DEV Community

Cover image for Zero to Autopilot, Part 3: Giving a Still Image Real Motion for $0.00
Maksims Gavrilovs
Maksims Gavrilovs

Posted on

Zero to Autopilot, Part 3: Giving a Still Image Real Motion for $0.00

Series: Zero to Autopilot — Building a Self-Improving AI Media Channel. Part 3 of 7. Part 1 was the landscape and my $10 wake-up call; Part 2 was the 7-stage pipeline. This one is the engineering centerpiece: replacing paid AI video with free motion.

Data status: real-now — real ffmpeg filtergraphs from the repo. Every effect here is playing in the live gallery (dasein108.github.io/slope-studio); code is open source.

Six free looks on one generated still — grain, vignette, chroma, glitch, sunrise, and a colour-graded variant. None of them cost a cent.

Viewers don't need generated video. They need motion.

The recap from Part 1 is one line of arithmetic: hosted AI image-to-video bills per second — kling at $0.07/s makes a 150-second Short cost about $10.50. Fine for a single hero shot; absurd as the default for every scene when the whole strategy depends on making hundreds of cheap experiments.

But viewers were never asking for generated video. They want the feeling of motion: a still that drifts, breathes, and cuts on the beat holds attention perfectly well. I'd internalized this years ago shipping indie games, where the entire craft is faking expensive things with cheap math — no budget for a particle artist, so you write a particle system; no budget for animation, so you parallax-scroll a few layers and call it atmosphere. The same instinct ports straight to AI media. Everything below is one still image, ffmpeg, and zero dollars.

ffmpeg is the whole trick: an effect is a string

The quiet hero here is ffmpeg. It ships with roughly 400 built-in filters, and an "effect" is just a few of them chained with commas — no render engine, no GPU shaders, no SDK, no per-call cost. One binary you already have. Every motion in this series is an ffmpeg filtergraph, which means adding an effect is adding a string.

Here is the entire implementation of oldfilm, the vintage look:

"[0:v]colorchannelmixer=.393:.769:.189:0:.349:.686:.168:0:.272:.534:.131,"  # → sepia
"eq=contrast=1.12:saturation=0.82:brightness='0.035*sin(27*t)+0.025*sin(11*t)',"  # flicker
"noise=alls=22:allf=t,"     # film grain, re-rolled every frame
"vignette=PI/4[v]"          # darkened corners
Enter fullscreen mode Exit fullscreen mode

Read it like a Unix pipe; each comma is "then":

  • colorchannelmixer — a 3×3 RGB matrix that maps the image to a sepia tone.
  • eq=…brightness='…sin(t)…'t is the frame's timestamp, so brightness wobbles over time: the projector-gate flicker. Time expressions are what make an effect animatesin(t) here, a creeping zoom in Ken-Burns next.
  • noise=allf=tf=t re-randomizes the grain every frame, so it shimmers instead of sitting frozen.
  • vignette=PI/4 — darken the corners.

Four stock filters, one string, and it moves. A glitch is rgbashift + noise; chromatic aberration is just rgbashift; rain is a particle layer composited with overlay. The reason this channel can afford hundreds of videos isn't a cheaper model — it's that the effect budget is a text editor and ffmpeg -filter_complex.

The effect families

That one binary buys a whole vocabulary. The catalog sorts into a handful of families, each answering a different question — what does this scene need?

  • Camera motionkenburns, motion-drift{left,right,up,down}, motion-zoom{in,out}, pulse. The cheapest possible life: a still pans, drifts, breathes. The default for most scenes.
  • Depthparallax, blurred-parallax. Real 2.5D: the foreground subject holds still while the background drifts behind it. For scenery with a clear subject.
  • Kinetic typekinetic. Emphasis: a headline slides in over the shot. For the hook or a key stat, not every scene.
  • Atmosphererain, snow, fog, embers, blood, petals, leaves, wind. Mood and a sense of place — the emotional weather, composited for free.
  • Colour & look gradesgrain, vignette, oldfilm, sunrise, sunset, godrays, chroma. Tone and era. This family does the most to separate intentional from slop: grain and a vignette alone (the cover image is one still run through six of these) read as "graded by someone who cares."
  • Impactflash[-white/-yellow/-red/-black], blood. A 2–3 frame punch for an action beat. Rare by design.
  • Characterspuppet (a cutout figure that hops or nods), talkinghead (Rhubarb lip-sync). A figure that acts or speaks, with no avatar model.
  • Vectormanim. Literal concept and maths visualization, 3Blue1Brown-style. The education power tool (and the one I haven't tamed — more below).
  • Transitionscut, fade, dissolve, wipeleft, slideup, slice. Rhythm: how one scene becomes the next.

They're all the same idea underneath — a filtergraph string — so the rest of this piece takes apart the three most interesting ones.

How the motion is wired

Each scene names an animator, and one dispatch function routes to the implementation. The important property is the last line: anything that fails falls back to Ken-Burns and records why in the manifest, so a missing optional dependency degrades the look instead of breaking the render.

# studio/animate.py
a = (animator or "kenburns").strip()
if a == "kenburns" or a == "":   ffmpeg.ken_burns(image, dst, seconds)
elif a.startswith("motion-"):    ffmpeg.motion(image, dst, seconds, preset=a.split("-", 1)[1])
elif a == "kinetic":             return _kinetic(scene, image, dst, seconds)
elif a == "parallax":            return _parallax(scene, image, dst, seconds)
elif a == "slice":               return _slice(scene, image, dst, seconds)
elif a == "puppet":              return _puppet(scene, image, dst, seconds)
elif a == "talkinghead":         return _talkinghead(scene, image, dst, seconds, audio)
elif a == "manim":               return _manim(scene, dst, seconds)
Enter fullscreen mode Exit fullscreen mode

The workhorse, Ken-Burns, is a single zoompan expression — over-scale the source 2× first so the crop never reaches an edge:

# studio/ffmpeg.py — ken_burns()
vf = (f"crop={w*2}:{h*2},"
      f"zoompan=z='min(zoom+0.0012,1.12)':d={frames}:s={w}x{h}:fps={fps}:"
      f"x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)'")
Enter fullscreen mode Exit fullscreen mode

z='min(zoom+0.0012,1.12)' creeps the zoom in a hair per frame, capped at 1.12×. The motion-* presets are the same machine with different z/x/y expressions — a whole family of movement from one filtergraph.

Parallax, the one effect ffmpeg can't do alone

Parallax — hold the subject still, drift the background behind it for depth — is the exception to "an effect is a string." ffmpeg can composite layers but it can't find a subject, so this one needs a small, very indie-dev hack first: rembg cuts the subject (the static foreground), Python builds a clean background plane, and only then does ffmpeg drift the back and overlay the front.

The "clean background" is the whole problem. The naive version drifts the original still behind the cutout — but that still already contains the subject, so you get a creepy ghost twin smearing across the back. The fix is to give ffmpeg a background that's complete behind the subject, two ways:

  • Inpaint it out of the same image (default) — a free blur-diffusion fill: repeatedly blur, then re-stamp the known pixels so the subject's hole heals with its surroundings.
  • Generate a separate plate — re-prompt the scene without the subject (--parallax-plates, +1 still). Cleaner, no inpaint guesswork.
# studio/animate.py — _inpaint_subject() (heal the subject's hole)
for _ in range(iters):
    blurred = bg.filter(ImageFilter.GaussianBlur(radius))
    bg = Image.composite(bg, blurred, subject_mask)  # keep outside, heal inside
Enter fullscreen mode Exit fullscreen mode

The subject (left) is cut out and the background (right) healed, so the drifting plane has no ghost.

There's also a cheaper third option that embraces the twin: blur the drifting plane hard so the duplicate melts into soft bokeh (blurred-parallax) — on busy backgrounds it reads as dreamy depth-of-field rather than a brittle cutout. A bug turned into a second legitimate look.

Text, and the font library that wasn't there

Kinetic type slides a headline in over a gently pulsing still. The text is rendered by Pillow into a transparent PNG and overlay-ed with an animated y so it rises into place:

# headline rises and settles over the first 0.6s
over = "[bg][t]overlay=x=(W-w)/2:y='H*0.18 - 50*min(t/0.6,1)':format=auto[v]"
Enter fullscreen mode Exit fullscreen mode

A kinetic headline over a pulsing still — the text is a Pillow PNG, animated in ffmpeg.

Why Pillow and not ffmpeg's drawtext? Because the box this renders on has an ffmpeg built without libfreetype and without libass — so drawtext and subtitles= both simply fail. Rather than fight the build, I render all text — headlines and burned caption strips alike — as Pillow PNGs and overlay them. The constraint forced a more portable design that happens to give pixel-perfect typographic control.

Choosing the effect: the model proposes, code constrains

A library this size is worthless if every scene defaults to Ken-Burns — which is exactly where this started. So a small art-direction layer (studio/artdirect.py) decides, with a deliberately hybrid policy:

  • The script model proposes a per-scene animator / atmosphere / fx / transition, choosing from a documented menu in its prompt, so the picks match the scene's mood — a duel gets embers and a red flash; a memory gets oldfilm; a landscape gets parallax.
  • A deterministic pass then constrains it: it validates the names, fills anything the model skipped with position and keyword heuristics (hook → kinetic, scenery → parallax), and applies taste caps — a flash is an impact, so it survives on at most one scene; a single atmosphere can't blanket the whole video.

"Model proposes, code constrains" recurs throughout this project; it's a good default whenever you want a model's judgement without its inconsistency. And because the same pass runs on the keyless stub path, every video gets real art direction instead of a wall of identical pans.

One concrete payoff: cheap punctuation for violence without gore (which also keeps the image model's content filter happy). A red flash on the cut plus a blood overlay, a few frames total — the viewer's mind fills in the rest, the narration carries the meaning, and it costs nothing.

The one I haven't cracked: manim

Manim, the engine behind 3Blue1Brown, is the most promising tool here and the least solved. True vector animation — a circle morphing into a square, a graph plotting itself, an equation transforming term by term — is close to a cheat code for an educational channel, rendered crisp for $0. A scene can carry a manim_code field the model writes, and the pipeline renders it.

Manim demos rendering as real vectors in the gallery — a rising sun, a morphing shape, a sine curve drawing itself, an orbit, a bar chart.

The catch is getting a model to author good, literal, compiling manim on demand. It reaches for abstract moving lines when what sells is the literal shape; the code is indentation-sensitive; and a meaningful fraction of generated scenes fail and fall back to Ken-Burns. For now it's hand-authored for hero beats, not trusted to the loop — the single biggest unlock left for the educational side, and squarely on the roadmap. If you've cracked LLM→manim, I genuinely want to hear it.

And the ears

Visual motion is only half of "not slop"; a silent Short feels dead. So there's a matching audio layer — AI-generated sound effects plus a music bed ducked under the narration via sidechain compression (the voice always wins; the bed sits at −24 dB). On one Short that entire layer cost $0.0076. The "make it feel produced" budget, picture and sound together, rounds to zero.

What I'd tell another AI engineer

Before paying a generative model, ask what the viewer actually needs — usually the perception of motion and intention, not literally generated video. A zoompan expression, a parallax composite, a grain overlay, and a ducked music bed deliver that for nothing, and the indie-game-dev instinct (fake the expensive thing with cheap math) ports directly to AI media. Route every effect through one module, give each a graceful fallback, and the pipeline gets cheaper and sturdier at once.


Next — Part 4: The Cost Collapse, $10 → $0.06. With motion free, the full cost model: per-second video math, right-sizing the image model (Nano Banana vs Flux Schnell), the tier system, the auto strategy that spends only on hero scenes, and the --max-cost pre-flight that refuses to overspend.

Live effects gallery: dasein108.github.io/slope-studio
Star the repo: github.com/dasein108/slope-studio
🔔 Subscribe to watch the experiment grow from zero: the channel

Top comments (0)