Alex Shev

Posted on Jun 5 • Originally published at terminalskills.io

Building an AI Short Video Generator: Why the Workflow Needs Skills, Not Just Prompts

#ai #devtools #automation #productivity

Most AI short-form video demos skip the boring part.

They show a finished TikTok, Reel, or YouTube Short. Maybe they show the prompt. Maybe they show the generated script or the final render.

But the hard part is not making one video.

The hard part is making the fifteenth video without the whole system turning into a pile of one-off scripts, half-remembered FFmpeg commands, broken captions, inconsistent hooks, and manual upload steps.

That is where I think the conversation around AI video automation gets more interesting.

Not:

Can an AI generate a Short?

But:

What workflow does an AI agent need to generate Shorts repeatedly?

I was looking at a Terminal Skills use case for building an AI short video generator, and the useful part is not the fantasy of "push one button, print infinite content."

The useful part is the stack.

The real job is a pipeline

A short-form video generator sounds like one tool.

In practice, it is a pipeline:

topic research
  -> script
  -> voiceover
  -> footage or visual generation
  -> subtitles
  -> assembly
  -> platform formatting
  -> upload
  -> analytics

Each step has different failure modes.

Topic research can produce generic ideas.
Scripts can be too long.
Voice can drift from the brand.
Footage can mismatch the narration.
Subtitles can land under platform UI.
FFmpeg can export a technically valid file that a platform still hates.
Uploads can succeed in the API but fail the actual publishing workflow.

If you try to solve all of that with one giant prompt, the agent has to keep too much operational knowledge in its head.

That is fragile.

The better pattern is to split the workflow into skills.

What a skill gives the agent

A skill is not just a code snippet.

For this kind of workflow, a useful skill tells the agent:

when to use this capability
what inputs are expected
what output should exist afterward
what validation is required
when to stop instead of pretending success

That last point matters.

For media automation, "the command ran" is not enough.

The agent needs to verify things like:

is the video actually 9:16?
is the duration within the target range?
does the file have an audio stream?
are captions inside the safe area?
is the codec platform-friendly?
did upload verification happen from the final published page, not just the composer?

This is the difference between an automation demo and an operating workflow.

A practical skills stack for short-form video

The Terminal Skills use case frames the AI short video generator as a stack, not a monolith.

I would break it down like this.

1. Research skill

This skill should not just "find trending topics."

It should produce usable candidates:

topic
why it is timely
target audience
hook angle
risk level
source links

For a YouTube Shorts pipeline, the research skill should bias toward ideas that can be explained visually in under 60 seconds.

Not every good article becomes a good Short.

2. Script skill

Short-form scripts need constraints.

A useful script skill should enforce:

one idea per video
hook in the first 1-2 seconds
short sentences
a clear visual beat for each section
no long intro
no vague CTA unless the channel actually uses CTAs

The output should be structured, not just prose:

{
  "hook": "This one missed call can cost a local business hundreds.",
  "beats": [
    { "time": "0-5s", "line": "Most small businesses do not lose leads in ads.", "visual": "phone ringing unanswered" },
    { "time": "5-15s", "line": "They lose them after the click.", "visual": "call log with missed calls" }
  ],
  "cta": "Follow for more local business automation ideas."
}

Now the renderer has something it can work with.

3. Voice skill

Text-to-speech is easy to call.

Brand-consistent voice is harder.

A voice skill should know:

preferred provider
voice ID or style
pacing
loudness target
whether to use pauses
file naming conventions
retry rules

It should also validate that the audio duration roughly matches the script timing before video assembly starts.

4. Caption skill

Captions are not decoration for Shorts.

They are part of the format.

A caption skill should own:

line length
word grouping
font size
contrast
bottom safe zone
whether to use word-level highlighting
SRT or burned-in output

This is where a lot of AI video pipelines become visibly cheap.

The content might be fine, but the captions are too low, too wide, too fast, or hidden under the TikTok/Shorts interface.

5. FFmpeg or assembly skill

This is the mechanical layer.

It should assemble the finished asset into predictable platform-ready output:

1080x1920
H.264
AAC
yuv420p
faststart metadata
30-60 seconds
safe captions
consistent naming

The important part is not memorizing the FFmpeg flags.

The important part is that the agent knows the output contract.

For example:

ffprobe -v error -show_streams -show_format -of json output/short.mp4

That check should happen after render, not after a human complains that the upload failed.

6. Upload skill

Upload automation is where I would be most conservative.

It is one thing to render a local MP4.
It is another thing to publish externally.

The upload skill should separate:

prepare upload
verify metadata
draft/schedule
publish
confirm public URL

Those should not all be one invisible step.

If a human approval gate is required, the skill should say so plainly.

The useful mental model

The mistake is thinking of this as:

prompt -> video

The better model is:

brief -> structured assets -> render -> verify -> publish decision

That model is slower to explain, but much more reliable in production.

It also gives the agent smaller jobs.

The research skill does not need to understand FFmpeg.
The caption skill does not need to know the YouTube Data API.
The upload skill does not need to invent the script.

Each skill owns a boundary.

That boundary is what makes the workflow debuggable.

What I would automate first

If I were building this from scratch, I would not start with full auto-publishing.

I would start with a local generator that produces a review folder:

shorts/
  001/
    script.json
    voiceover.wav
    captions.srt
    final.mp4
    checks.json
    publish-notes.md

Then the agent reports:

Generated 12 Shorts.
10 passed validation.
2 need review:
- #04 captions exceed safe zone
- #09 audio duration is longer than target

That is already valuable.

It removes the repetitive production work while keeping a human in control of the final publishing decision.

Only after that is reliable would I add scheduling or upload automation.

The bigger point

AI video automation is not just a model problem.

It is a workflow problem.

The teams that win here will not be the ones with the longest prompt.

They will be the ones that turn each fragile part of the process into a small, documented, reusable skill:

research
scripting
voice
captions
rendering
validation
upload
analytics

That is how you move from "I made one cool video" to "I can produce a repeatable content pipeline without babysitting every export."

And that is the part I care about most.

The demo is the video.

The product is the workflow.

Source use case: Build an AI Short Video Generator

Top comments (2)

danio • Jun 7

The "fifteenth video" framing nails it — one Short is a demo, the fifteenth is a systems problem. I run a daily AI-news Short pipeline myself, and the failure mode you flagged, "subtitles can land under platform UI," bit me for weeks until I reserved a fixed caption-safe zone instead of trusting the renderer. Treating each stage as its own contract is the part most demos skip.

Alex Shev • Jul 14

That caption-safe-zone example is exactly the kind of thing that separates a demo from a pipeline.

The first video can survive manual fixes. The fifteenth needs layout rules, platform constraints, render checks, and a contract for where text is allowed to live. Otherwise every export becomes a small guessing game against the UI chrome.

I like treating those constraints as part of the skill, not as post-production trivia. They are what make the workflow repeatable.