Maksims Gavrilovs

Posted on Jun 8 • Edited on Jun 12

Zero to Autopilot, Part 3: Giving a Still Image Real Motion for $0.00

#ffmpeg #python #ai #video

Series: Zero to Autopilot — Building a Self-Improving AI Media Channel. Part 3 of 7. Part 1 was the landscape and my $10 wake-up call; Part 2 was the 7-stage pipeline. This one is the engineering centerpiece: replacing paid AI video with free motion.

Data status: real-now — real ffmpeg filtergraphs from the repo. Every effect here is playing in the live gallery (dasein108.github.io/slope-studio); code is open source.

Viewers don't need generated video. They need motion.

The recap from Part 1 is one line of arithmetic: hosted AI image-to-video bills per second — kling at $0.07/s makes a 150-second Short cost about $10.50. Fine for a single hero shot; absurd as the default for every scene when the whole strategy depends on making hundreds of cheap experiments.

But viewers were never asking for generated video. They want the feeling of motion: a still that drifts, breathes, and cuts on the beat holds attention perfectly well. I'd internalized this years ago shipping indie games, where the entire craft is faking expensive things with cheap math — no budget for a particle artist, so you write a particle system; no budget for animation, so you parallax-scroll a few layers and call it atmosphere. The same instinct ports straight to AI media. Everything below is one still image, ffmpeg, and zero dollars.

ffmpeg is the whole trick: an effect is a string

The quiet hero here is ffmpeg. It ships with roughly 400 built-in filters, and an "effect" is just a few of them chained with commas — no render engine, no GPU shaders, no SDK, no per-call cost. One binary you already have. Every motion in this series is an ffmpeg filtergraph, which means adding an effect is adding a string.

Here is the entire implementation of oldfilm, the vintage look:

"[0:v]colorchannelmixer=.393:.769:.189:0:.349:.686:.168:0:.272:.534:.131,"  # → sepia
"eq=contrast=1.12:saturation=0.82:brightness='0.035*sin(27*t)+0.025*sin(11*t)',"  # flicker
"noise=alls=22:allf=t,"     # film grain, re-rolled every frame
"vignette=PI/4[v]"          # darkened corners

Read it like a Unix pipe; each comma is "then":

colorchannelmixer — a 3×3 RGB matrix that maps the image to a sepia tone.
eq=…brightness='…sin(t)…' — t is the frame's timestamp, so brightness wobbles over time: the projector-gate flicker. Time expressions are what make an effect animate — sin(t) here, a creeping zoom in Ken-Burns next.
noise=allf=t — f=t re-randomizes the grain every frame, so it shimmers instead of sitting frozen.
vignette=PI/4 — darken the corners.

Four stock filters, one string, and it moves. A glitch is rgbashift + noise; chromatic aberration is just rgbashift; rain is a particle layer composited with overlay. The reason this channel can afford hundreds of videos isn't a cheaper model — it's that the effect budget is a text editor and ffmpeg -filter_complex.

The effect families

That one binary buys a whole vocabulary. The catalog sorts into a handful of families, each answering a different question — what does this scene need?

Camera motion — kenburns, motion-drift{left,right,up,down}, motion-zoom{in,out}, pulse. The cheapest possible life: a still pans, drifts, breathes. The default for most scenes.
Depth — parallax, blurred-parallax. Real 2.5D: the foreground subject holds still while the background drifts behind it. For scenery with a clear subject.
Kinetic type — kinetic. Emphasis: a headline slides in over the shot. For the hook or a key stat, not every scene.
Atmosphere — rain, snow, fog, embers, blood, petals, leaves, wind. Mood and a sense of place — the emotional weather, composited for free.
Colour & look grades — grain, vignette, oldfilm, sunrise, sunset, godrays, chroma. Tone and era. This family does the most to separate intentional from slop: grain and a vignette alone (the cover image is one still run through six of these) read as "graded by someone who cares."
Impact — flash[-white/-yellow/-red/-black], blood. A 2–3 frame punch for an action beat. Rare by design.
Characters — puppet (a cutout figure that hops or nods), talkinghead (Rhubarb lip-sync). A figure that acts or speaks, with no avatar model.
Vector — manim. Literal concept and maths visualization, 3Blue1Brown-style. The education power tool (and the one I haven't tamed — more below).
Transitions — cut, fade, dissolve, wipeleft, slideup, slice. Rhythm: how one scene becomes the next.

They're all the same idea underneath — a filtergraph string — so the rest of this piece takes apart the three most interesting ones.

How the motion is wired

Each scene names an animator, and one dispatch function routes to the implementation. The important property is the last line: anything that fails falls back to Ken-Burns and records why in the manifest, so a missing optional dependency degrades the look instead of breaking the render.

# studio/animate.py
a = (animator or "kenburns").strip()
if a == "kenburns" or a == "":   ffmpeg.ken_burns(image, dst, seconds)
elif a.startswith("motion-"):    ffmpeg.motion(image, dst, seconds, preset=a.split("-", 1)[1])
elif a == "kinetic":             return _kinetic(scene, image, dst, seconds)
elif a == "parallax":            return _parallax(scene, image, dst, seconds)
elif a == "slice":               return _slice(scene, image, dst, seconds)
elif a == "puppet":              return _puppet(scene, image, dst, seconds)
elif a == "talkinghead":         return _talkinghead(scene, image, dst, seconds, audio)
elif a == "manim":               return _manim(scene, dst, seconds)

The workhorse, Ken-Burns, is a single zoompan expression — over-scale the source 2× first so the crop never reaches an edge:

# studio/ffmpeg.py — ken_burns()
vf = (f"crop={w*2}:{h*2},"
      f"zoompan=z='min(zoom+0.0012,1.12)':d={frames}:s={w}x{h}:fps={fps}:"
      f"x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)'")

z='min(zoom+0.0012,1.12)' creeps the zoom in a hair per frame, capped at 1.12×. The motion-* presets are the same machine with different z/x/y expressions — a whole family of movement from one filtergraph.

Parallax, the one effect ffmpeg can't do alone

Parallax — hold the subject still, drift the background behind it for depth — is the exception to "an effect is a string." ffmpeg can composite layers but it can't find a subject, so this one needs a small, very indie-dev hack first: rembg cuts the subject (the static foreground), Python builds a clean background plane, and only then does ffmpeg drift the back and overlay the front.

The "clean background" is the whole problem. The naive version drifts the original still behind the cutout — but that still already contains the subject, so you get a creepy ghost twin smearing across the back. The fix is to give ffmpeg a background that's complete behind the subject, two ways:

Inpaint it out of the same image (default) — a free blur-diffusion fill: repeatedly blur, then re-stamp the known pixels so the subject's hole heals with its surroundings.
Generate a separate plate — re-prompt the scene without the subject (--parallax-plates, +1 still). Cleaner, no inpaint guesswork.

# studio/animate.py — _inpaint_subject() (heal the subject's hole)
for _ in range(iters):
    blurred = bg.filter(ImageFilter.GaussianBlur(radius))
    bg = Image.composite(bg, blurred, subject_mask)  # keep outside, heal inside

There's also a cheaper third option that embraces the twin: blur the drifting plane hard so the duplicate melts into soft bokeh (blurred-parallax) — on busy backgrounds it reads as dreamy depth-of-field rather than a brittle cutout. A bug turned into a second legitimate look.

Text, and the font library that wasn't there

Kinetic type slides a headline in over a gently pulsing still. The text is rendered by Pillow into a transparent PNG and overlay-ed with an animated y so it rises into place:

# headline rises and settles over the first 0.6s
over = "[bg][t]overlay=x=(W-w)/2:y='H*0.18 - 50*min(t/0.6,1)':format=auto[v]"

Why Pillow and not ffmpeg's drawtext? Because the box this renders on has an ffmpeg built without libfreetype and without libass — so drawtext and subtitles= both simply fail. Rather than fight the build, I render all text — headlines and burned caption strips alike — as Pillow PNGs and overlay them. The constraint forced a more portable design that happens to give pixel-perfect typographic control.

Choosing the effect: the model proposes, code constrains

A library this size is worthless if every scene defaults to Ken-Burns — which is exactly where this started. So a small art-direction layer (studio/artdirect.py) decides, with a deliberately hybrid policy:

The script model proposes a per-scene animator / atmosphere / fx / transition, choosing from a documented menu in its prompt, so the picks match the scene's mood — a duel gets embers and a red flash; a memory gets oldfilm; a landscape gets parallax.
A deterministic pass then constrains it: it validates the names, fills anything the model skipped with position and keyword heuristics (hook → kinetic, scenery → parallax), and applies taste caps — a flash is an impact, so it survives on at most one scene; a single atmosphere can't blanket the whole video.

"Model proposes, code constrains" recurs throughout this project; it's a good default whenever you want a model's judgement without its inconsistency. And because the same pass runs on the keyless stub path, every video gets real art direction instead of a wall of identical pans.

One concrete payoff: cheap punctuation for violence without gore (which also keeps the image model's content filter happy). A red flash on the cut plus a blood overlay, a few frames total — the viewer's mind fills in the rest, the narration carries the meaning, and it costs nothing.

The one I haven't cracked: manim

Manim, the engine behind 3Blue1Brown, is the most promising tool here and the least solved. True vector animation — a circle morphing into a square, a graph plotting itself, an equation transforming term by term — is close to a cheat code for an educational channel, rendered crisp for $0. A scene can carry a manim_code field the model writes, and the pipeline renders it.

The catch is getting a model to author good, literal, compiling manim on demand. It reaches for abstract moving lines when what sells is the literal shape; the code is indentation-sensitive; and a meaningful fraction of generated scenes fail and fall back to Ken-Burns. For now it's hand-authored for hero beats, not trusted to the loop — the single biggest unlock left for the educational side, and squarely on the roadmap. If you've cracked LLM→manim, I genuinely want to hear it.

And the ears

Visual motion is only half of "not slop"; a silent Short feels dead. So there's a matching audio layer — AI-generated sound effects plus a music bed ducked under the narration via sidechain compression (the voice always wins; the bed sits at −24 dB). On one Short that entire layer cost $0.0076. The "make it feel produced" budget, picture and sound together, rounds to zero.

The road not taken: self-hosting the video model

There's a tempting middle path I should address, because every engineer asks it: the video models are open-weight now — why not run one locally and get real AI video for free too? I have a MacBook M4 with 36 GB of unified memory, so I wired a local ComfyUI + Wan 2.2 5B backend into the pipeline as a local-i2v provider and found out. Short version: it works, it's free, and it's a draft-tier toy you should keep out of your render path.

The log, honestly:

fp8 weights are broken on Apple's MPS backend — they load and produce NaN. So everything is GGUF-quantized (Wan 5B at Q4 ≈ 3.4 GB, plus a ~3.6 GB text encoder).
The full-precision version (~22 GB resident) plus the video VAE-decode spike blew past physical RAM, and because MPS has no real offload, macOS swapped and hung the whole machine — not the process, the OS. The fix is a PyTorch MPS watermark cap so a runaway allocation kills the process cleanly instead.
Even stable, it's slow: a 2-second clip took about 15 minutes, and per-step time accelerates off a cliff once memory pressure starts evicting.
And it improvises. On the Persian-miniature still below, Wan added genuine motion — then warped the ornate border and invented a hooded figure that wasn't in the source.

Set that against the hosted option — kling renders a 6-second hero clip in under a minute for about 42 cents — and "free" local generation costs you 15+ minutes, a fragile machine, and a worse result. Free isn't free when it's measured in wall-clock. So the verdict loops right back to this article's thesis: free ffmpeg motion for the overwhelming majority of scenes, a few cents of hosted video for the rare hero shot, and if you must run local, cap it to 1–2 seconds of motion on one or two scenes and Ken-Burns the rest. It stays in the repo as a draft-tier provider — glad I tried it, glad I didn't ship it.

That last pattern is exactly how I built this 55-second Rubaiyat reel: two of its four scenes got ~2 seconds of local Wan motion (then hold the last frame for the rest of the line), the other two are pure Ken Burns — total video-generation cost, $0. It's the honest sweet spot for local i2v on a Mac: a brief breath of real generated motion where it counts, free camera motion everywhere else.

What I'd tell another AI engineer

Before paying a generative model, ask what the viewer actually needs — usually the perception of motion and intention, not literally generated video. A zoompan expression, a parallax composite, a grain overlay, and a ducked music bed deliver that for nothing, and the indie-game-dev instinct (fake the expensive thing with cheap math) ports directly to AI media. Route every effect through one module, give each a graceful fallback, and the pipeline gets cheaper and sturdier at once.

Next — Part 4: The Cost Collapse, $10 → $0.06. With motion free, the full cost model: per-second video math, right-sizing the image model (Nano Banana vs Flux Schnell), the tier system, the auto strategy that spends only on hero scenes, and the --max-cost pre-flight that refuses to overspend.

▶ Live effects gallery: dasein108.github.io/slope-studio
⭐ Star the repo: github.com/dasein108/slope-studio
🔔 Subscribe to watch the experiment grow from zero: the channel

DEV Community