DEV Community

bykamo
bykamo

Posted on • Originally published at bykamodev.Medium

I Made 8 Book Trailers for $9.98 (They're Not Perfect, and That's the Point)

I had eight short mystery novels and no budget for video. So I built a pipeline: turn each storyboard frame into a 5-second clip with PIKA's image-to-video, then do all the cutting, panning, titling and cross-fades locally with ffmpeg. PIKA gave me 1,500 trial credits. I burned through all of them, topped up twice at $4.99, and shipped eight 17-second trailers. Total cash out of pocket: $9.98. The ffmpeg half cost nothing.

Up front, so nobody feels misled: these trailers are not flawless. Look closely and you'll find a frame where a hand isn't quite right, or a background detail that shimmers. Every one of the eight has a rough spot or two. The point of this post isn't "AI makes perfect video now" — it's the opposite. AI makes flawed video, predictably, in specific places, and the whole game is knowing where it breaks and routing around those spots for $0 instead of burning credits trying to force perfection. That's how eight trailers cost ten dollars instead of a hundred.

See for yourself — here's one of the eight (Book 6, 17 seconds):

[https://youtube.com/shorts/EZp40yhB_rs?si=ADccBnDbspgNGdAQ]

Rough spots and all.

TL;DR

  • Eight book trailers, end-to-end AI pipeline: image → PIKA image-to-video → local ffmpeg.
  • The animation prompt is the whole trick: "extremely subtle motion only" — let steam, light and petals move, freeze everything else. The moment a character's hands move, PIKA melts them.
  • Real cost: 1,500 free trial credits + $9.98 in top-ups. ffmpeg = $0.
  • Audio: an original score I composed, added with the free OSS editor OpenCut — $0.
  • They're not perfect. Each trailer has 1–2 visible rough spots. The method is about containing that, not eliminating it.
  • The honest part is in What broke: melted fingers, a face that turned into a horror shot, and an ffmpeg sample-aspect bug that stretched every tilt clip 2% wide until I caught it.

What this method does and doesn't do

Because the framing matters before you read a single command:

  • It does: produce watchable, atmospheric 9:16 trailers cheaply and repeatably, with every deterministic decision (timing, framing, text, transitions) under your control in ffmpeg.
  • It doesn't: make PIKA stop mangling hands, faces, and small props. That limitation is real and this pipeline does not fix it. What it does is give you a $0 escape hatch — animate the still instead of the motion — so a broken shot costs you nothing and never spirals into a re-roll bill.

If you want broadcast-perfect output, this isn't it. If you want eight decent trailers for the price of a sandwich, read on.

Architecture

  novel text (.md) + storyboard (.png)
            │
            ▼   [ image generation — done separately, Codex-driven,
   8 still scenes              one character sheet as the consistency anchor ]
   (941×1672 portrait
    or 1672×941 land.)
            │
            ▼   PIKA: upload_asset → R2 PUT → md5 verify
   generate_video
   (image_to_video, pika 2.2, 1080p, 5s,
    prompt = "extremely subtle motion only")
            │
            ▼   download each clip
   ┌─────────  LOCAL ffmpeg  ─────────┐
   │  Ken Burns (zoompan / crop)       │
   │  normalize SAR (setsar=1)         │
   │  concat 8 clips                   │
   │  burn captions (drawtext ×4)      │
   │  cross-fade to cover (xfade 0.8s) │
   └───────────────────────────────────┘
            │
            ▼
   final .mp4  (≈17–19s, 1080p30, silent)
Enter fullscreen mode Exit fullscreen mode

The hinge is the split: PIKA only animates; it never composes. Every cut, pan, caption and transition is deterministic ffmpeg I can re-run for free. That's what keeps the variable cost near zero — I only pay PIKA for motion, and I pay it once per shot.

Note on the upstream image step (done in ChatGPT, outside the ffmpeg kit): I first define the cast on one character sheet — every recurring character in a single image, which becomes the consistency anchor. Then I ask for a vertical storyboard: eight numbered panels, each with a 9:16 frame, a timecode (0:00–0:02 …), and a scene / action / camera note. Then I have each panel rendered as a still. The storyboard is the spec for everything downstream — framing, order and duration are all decided there. This post picks up where the eight stills exist and need to move.

How it works

One book = 8 clips. PIKA returns 5-second 1080p clips; ffmpeg trims each to 1.8s (the final shot to 2.4s) → a 15.0s body, then a 0.8s cross-fade into the cover for ~17s total.

Step 1 — Animate each still on PIKA

I drive PIKA through its MCP server, not the REST API directly — so the steps below (upload_asset → R2 PUT → md5 verify → generate_video) are MCP tool calls an agent makes on my behalf, which is why the whole thing is scriptable from a chat session. Each still goes in as one image with an image_to_video call (pika 2.2, 1080p).

The per-shot prompt isn't hand-written — it's generated one cut at a time by reading three things together: the storyboard panel (scene / action / camera / on-screen text), the actual still (looked at directly — who's wearing what, what's in frame, what's in their hands), and the story beat from the manuscript (where this moment sits emotionally). Out of that comes a four-part instruction that follows the same template every time:

[what's in the frame, read from the image]
+ [the ONLY things allowed to move: steam / light / petals / water]
+ [what must stay frozen: people still, hands stable, objects unchanging]
+ [one word of tone, from the story beat]
Enter fullscreen mode Exit fullscreen mode

Here's a real one — Book 6, the shot where a note is tucked against a cup:

A held, still moment: a young woman in a yellow cardigan gently
tucks a small folded note against a paper cup among a row of
leaf-latte cups, Aoi watching quietly behind, the cat nearby.
Her hand and the note stay completely still and stable, the
paper unchanging. The ONLY motion is faint steam rising and a
soft flicker of warm lamp light.
Enter fullscreen mode Exit fullscreen mode

Notice the balance: almost every word tells PIKA what not to move. That's the opposite of normal video prompting, and it's deliberate — PIKA's image-to-video gets worse the more motion you ask for, so the prompt's real job is to fence off everything fragile (hands, small props, faces) and leave only steam, light, petals and water free. The negative prompt does the same defensively: deformed hands, extra fingers, melting hand, morphing note, warping paper.

One hard rule: camera moves never go to PIKA. Pans, tilts and zooms — the storyboard's CAMERA column — are added later in ffmpeg (zoompan/crop, Step 2). Ask PIKA to move the camera and it warps the whole composition, so PIKA only ever adds ambient micro-motion in place; every deliberate camera move is a deterministic ffmpeg pass. Same thesis as before — the generative model does the one thing it's good at, ffmpeg does everything that has to be repeatable.

Step 2 — Ken Burns each clip (portrait example)

zoompan gives a clean slow push and, importantly, comes out at SAR 1:1:

ffmpeg -y -i c2.mp4 -t 1.8 -filter_complex \
"[0:v]fps=30,scale=1380:2400,setsar=1,\
zoompan=z='min(zoom+0.0009,1.11)':d=1:\
x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':s=1080x1920:fps=30[v]" \
-map "[v]" -an -c:v libx264 -pix_fmt yuv420p -r 30 p2.mp4
Enter fullscreen mode Exit fullscreen mode

Mind the quotes. The min(...) expression contains a comma, and inside -filter_complex an unquoted comma is read as a filter separator — zoompan=z=min(zoom+0.0009,1.11) fails with No option name near '1:x='. Single-quote every expression value (z='...', x='...', y='...') and the comma is safe. (I verified this whole sequence end-to-end on the real frames before publishing; the unquoted version genuinely does not run.)

For a tilt-up (river surface → café) I use crop with a time-driven offset — but crop quietly poisons the sample aspect ratio, so it gets re-normalized immediately (see What broke #SAR):

ffmpeg -y -i c1.mp4 -t 1.8 -filter_complex \
"[0:v]fps=30,scale=1380:2400,setsar=1[b];\
[b]crop=1080:1920:(in_w-1080)/2:(ih-1920)*(1-(t/1.8)):1[v]" \
-map "[v]" -an -c:v libx264 -pix_fmt yuv420p -r 30 p1.mp4
ffmpeg -y -i p1.mp4 -vf "setsar=1" -c:v libx264 -pix_fmt yuv420p -r 30 -an p1fix.mp4
Enter fullscreen mode Exit fullscreen mode

Step 3 — Concat the 8 clips

printf "file %s\n" p1.mp4 p2.mp4 p3.mp4 p4.mp4 p5.mp4 p6.mp4 p7.mp4 p8.mp4 > cl.txt
ffmpeg -y -f concat -safe 0 -i cl.txt -vf "setsar=1" \
-c:v libx264 -pix_fmt yuv420p -r 30 -an body.mp4
Enter fullscreen mode Exit fullscreen mode

Step 4 — Burn captions, one pass per line

Stacking multiple drawtext filters in one -vf string kept blowing up in the shell. So each caption is its own pass, daisy-chained — pass 1 reads body.mp4 and writes s1.mp4, pass 2 reads s1.mp4, and so on, four captions = four passes. Each caption's alpha fade window is set to the duration of the cut it sits under, with a hand-rolled fade-in/hold/fade-out:

FONT="/usr/share/fonts/truetype/dejavu/DejaVuSerif.ttf"
ffmpeg -y -i body.mp4 -vf \
"drawtext=fontfile=$FONT:text=Some words travel quietly.:\
fontcolor=white:fontsize=46:x=(w-text_w)/2:y=h-460:\
shadowcolor=black@0.8:shadowx=2:shadowy=2:\
alpha='if(lt(t,3.7),0,if(lt(t,4.3),(t-3.7)/0.6,if(lt(t,5.2),1,if(lt(t,5.7),(5.7-t)/0.5,0))))'" \
-c:v libx264 -pix_fmt yuv420p -r 30 -an s2.mp4
Enter fullscreen mode Exit fullscreen mode

Two portability gotchas here: drawtext only exists if your ffmpeg was built --enable-libfreetype (some minimal builds ship without it — check ffmpeg -filters | grep drawtext), and the fontfile path is OS-specific (the DejaVu path above is Linux; on macOS point it at something like /System/Library/Fonts/Supplemental/Georgia.ttf). The alpha='...' value is single-quoted for the same reason as the zoompan expressions — those commas would otherwise split the filter.

Step 5 — Cross-fade into the cover

A portrait cover (1600×2560) is height-fit and center-cropped to fill 9:16 with no black bars, then xfaded over the last 0.8s of the body:

ffmpeg -y -loop 1 -i cover_v.png -t 2.8 -filter_complex \
"scale=-1:1920,setsar=1,crop=1080:1920:(in_w-1080)/2:0,\
zoompan=z='min(zoom+0.0003,1.03)':d=1:x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':s=1080x1920:fps=30" \
-c:v libx264 -pix_fmt yuv420p -r 30 -an cover_clip.mp4

ffmpeg -y -i body_text.mp4 -i cover_clip.mp4 -filter_complex \
"[0:v]settb=AVTB,fps=30,format=yuv420p,setsar=1[a];\
[1:v]settb=AVTB,fps=30,format=yuv420p,setsar=1[b];\
[a][b]xfade=transition=fade:duration=0.8:offset=14.2[v]" \
-map "[v]" -c:v libx264 -pix_fmt yuv420p -r 30 -an final.mp4
Enter fullscreen mode Exit fullscreen mode

What it cost

  • PIKA image-to-video: 1,500 free trial credits, fully consumed, then 2 × $4.99 top-ups = $9.98.
  • ffmpeg (all editing): $0 — runs locally, no per-render charge.
  • Image generation (upstream): ran on a separate Codex-based workflow; not billed into this number.
  • Music: $0 cash — every trailer carries an original score I composed myself. No licensing, no library fees. ffmpeg locks the silent video; then I drop the track on in OpenCut, a free open-source editor, and export. So the whole audio layer is $0 too: free tool, own music.

Total cash out of pocket: $9.98 for eight finished trailers. I won't pretend the trial credits were worth literally nothing, but the only money that left my account was the two top-ups.

(One honest gap: PIKA's API response doesn't return a per-job credit charge, so I can't give you a clean cost-per-clip. What I can count is generations — roughly 8–9 video calls per book, because a couple of shots per book got re-rolled or rescued, not animated.)

What broke

This is the part that actually transfers. None of these got fully "solved" — they got contained. That distinction is the whole method.

  1. Moving hands and props melt. The single most frequent failure. Handing over a coin, lifting a spoon to the mouth, slipping a folded note into a cup — fingers morph and the object duplicates or smears. A Gemini pass on the output flagged it severe ("fingers fuse, the note becomes a warped block"). Fix: for that one shot, don't animate it at all — drive the original still through ffmpeg Ken Burns (zoom only). The hand never moves, so it can't break, and it costs zero extra credits. That's the trick that kept the bill at $9.98 instead of a re-roll spiral.

  2. A face eating head-on turns into a horror shot. One book had an elderly woman lifting a spoon while facing the camera; the animated version was genuinely unsettling. Swapped the source frame to a profile and it resolved. Lesson: for an emotional close-up, a side angle is safer than a front angle.

  3. crop silently sets SAR to 46:45 and stretches everything 2% wide. The worst bug because it's invisible until you compare side by side. zoompan clips come out 1:1; crop-based tilt clips don't. Fix: force setsar=1 on every clip and verify the concatenated result with ffprobe (sample_aspect_ratio=1:1) before titling. Never trust the concat to inherit it.

  4. Multiple drawtext filters in one -vf string die in the shell (returncode 234) — the commas and quoting collide. Fix: one caption per ffmpeg pass (four captions = four passes). And any comma inside the text has to be escaped \\, or it's read as a filter separator.

  5. A landscape title card in a 9:16 frame = black bars. Fixed by using the portrait cover, scale=-1:1920 then center-crop, so it fills the frame. Rule I added: always extract the final frame and eyeball it — captions and cover text get clipped in ways the timeline doesn't show.

  6. What I couldn't contain. Being straight about the leftover rough spots: a couple of shots still have a slightly-off hand or a background element that shimmers under motion, even after routing the worst offenders to the still-image escape hatch. At trailer speed (1.8s per shot, lots of cuts) most viewers don't catch them, but they're there. I decided eight watchable trailers today beat zero perfect ones next month. Your call on where that line sits.

Why this matters

There's a lot of "I made a video with AI" content, and almost none of it tells you where the seams are. The seam here is specific and reusable: let the generative model do only the thing it's good at — ambient motion — and move every deterministic decision (timing, framing, text, transitions) into ffmpeg, where it's free and repeatable. That division is why eight trailers cost ten dollars instead of a hundred, and why the failures were containable: when a shot broke, I had a $0 escape hatch (animate the still instead) sitting right there.

It's also just the texture of building things in Kyoto on a shoestring — eight little mystery novels set in a riverside café, given motion by a trial account and a text editor full of ffmpeg one-liners.


Want the scripts?

I packaged the whole thing — the parametrized make_trailer.sh (portrait + landscape), the PIKA shot-by-shot prompt recipes, the SAR-verification script, and a README that lays out which shots to animate vs. route to the $0 still-image escape hatch — into a kit. It assumes the same honest premise as this post: it won't make PIKA flawless, it'll help you ship watchable trailers cheaply and skip the failures I already paid for.

→ ffmpeg Book Trailer Kit on Ko-fi: https://ko-fi.com/s/a637e3c118


I'm KAMO, a developer in Kyoto. I write implementation logs — working code, real costs, what broke.

Top comments (0)