Manoranjan Xuseen

Posted on Apr 9

Why AI-Generated Videos Look Disjointed (and the Claude Code Skill I Built to Fix It)

#ai #cursor #claude

The Problem Nobody Talks About

If you've used any AI video generator in the last year — Sora, Veo, Kling, Runway, Luma, Seedance, Wan, pick your poison — you've probably run into the same wall:

You can make one beautiful 5-second clip. But you can't make a 30-second video that doesn't look like garbage.

The individual shots are stunning. The cinematography is often better than amateur footage shot on a phone. The lighting is usually dreamlike. And then you try to make an actual TikTok or ad or explainer, and you end up with this:

Shot 1: warm golden hour, shallow DOF, gorgeous
Shot 2: suddenly clinical daylight, deep focus, different lens entirely
Shot 3: back to cinematic, but a completely different color palette
Shot 4: looks like it was shot on a different planet
Final video: jarring cuts that scream "this was made by six different cameras on six different days"

The tools aren't the problem. The tools can produce world-class shots. The problem is that you're treating each generation as an independent creative decision instead of part of a coordinated shoot.

I spent way too long fighting this and eventually built a Claude Code skill to fix it. This post is about what I learned — the conceptual insight, the technical approach, and a walkthrough you can steal.

The Core Insight: Visual Consistency > Shot Quality

Here's the counterintuitive thing: a mediocre but consistent set of shots edits together. Six gorgeous but mismatched shots do not.

If you've ever worked with a real cinematographer, this is obvious. Before the camera rolls on day one, they lock in:

Color palette
Key lighting direction and temperature
Lens choice (prime vs zoom, focal length, max aperture)
Film stock (or digital LUT) for the "look"
Camera movement grammar (handheld vs locked vs crane)

Every shot in the project respects that baseline. That's how 100 clips cut together feel like one intentional movie. It's not about any single shot being perfect. It's about every shot belonging to the same world.

AI video generators give you zero enforcement of this. Every prompt is a clean slate. If you don't encode the baseline into every prompt, you get six different movies in a 30-second timeline.

Why Single-Prompt Optimizers Don't Help

The AI space is already crowded with "prompt optimizer" tools for video. They take your vague idea and spit out a more detailed single prompt. Useful if you're making one clip. Useless if you're making six.

A prompt optimizer can turn "horse running" into:

A majestic horse galloping across a sun-drenched meadow, golden hour backlight with warm rim light, shallow depth of field, slow motion, cinematic 1080p, 16:9

That's a fine prompt. But if your video needs 6 shots and you optimize each one independently, you'll get 6 different "optimal" prompts with 6 different visual languages. Each one is individually optimized and collectively broken.

What's missing isn't better per-shot prompting. It's a layer above the prompts — a shared visual grammar that every shot has to respect.

The Approach: A Shot-List Storyboard

The skill I built does exactly one thing: it turns a brief into a coordinated shot-list storyboard.

Here's the workflow:

Step 1 — Brief Intake

The skill asks for five things:

Platform and duration (TikTok 30s, Reel 15-60s, YouTube Short, ad, explainer)
What the video is about
Brand vibe (cozy, energetic, premium, minimalist, playful, cinematic)
Call to action
Hard constraints (logo, colors, locations)

Five questions, one message. No multi-turn interrogation.

Step 2 — Infer Structure

Duration → shot count, based on platform conventions:

Platform	Duration	Shots	Pacing
TikTok Hook	15s	3	Fast cuts, single idea
TikTok Reel	30s	6	Hook → Build → Payoff → CTA
Instagram Ad	30s	6	Hook → Problem → Product → Benefit → Proof → CTA
YouTube Short	60s	12	Hook → 3-act structure → CTA
Product Explainer	90s	18	Problem → Solution → How it works → Results → CTA

Five seconds per shot is the sweet spot: long enough to land an idea, short enough to match the average scroll-dwell time on TikTok and Reels.

Step 3 — The Visual Theme Layer (The Important Part)

Before writing a single shot, the skill locks in a shared visual language. This is the layer that makes everything work:

## Visual Theme (applied to every shot)

- Color palette: Deep espresso brown #3B2416, cream #F5E6D3,
  muted amber #D4A574, soft sage green #8FA88C
- Lighting: Warm golden backlight with motivated window light,
  soft shadows, no harsh fluorescents
- Lens: Shallow DOF, gentle bokeh, 35mm full-frame look
- Film: Subtle 16mm grain, slightly muted saturation,
  warm 3200K color temperature
- Motion: Locked-off or very slow push-ins, no handheld shake

Every subsequent prompt must reference these values. Not "warm lighting" in the abstract — "warm golden backlight with motivated window light at 3200K color temperature, shallow DOF, 35mm full-frame look, subtle 16mm film grain." The consistency is enforced through repetition.

This feels verbose when you look at a single prompt, but it's exactly how you get six independently generated clips to look like they came from the same shoot.

Step 4 — Write Each Shot

Every shot gets this structure:

## Shot N (START-ENDs) — [Purpose: Hook/Setting/Human/Detail/CTA]

Composition: [shot type + angle, e.g., "Extreme close-up, overhead"]
Camera move: [locked/slow dolly in/tracking/crane up]
Lighting:    [from Visual Theme, applied to this scene]
Subject:     [what is in frame]
Action:      [what is happening]

Prompt to copy:
> [40-80 word cinematic prompt including all visual theme values,
>  ending with "cinematic 1080p, synchronized audio, Ns, [aspect]"]

Audio direction: [ambient/music beat/voice-over line]

The critical rules:

Every prompt repeats the shared visual language — palette, lighting, lens, film look
Be concrete — "a woman" → "a barista in her late 20s with wavy auburn hair, denim apron"
Use cinematography vocabulary — ECU, CU, MS, WS, OTS, dolly, crane, tracking, rack focus
Always end with technical spec — duration, aspect ratio, "cinematic 1080p, synchronized audio"

Step 5 — Add a Narrative Arc

The sequence isn't random. Every video needs a story structure. The skill uses three patterns depending on video type:

TikTok Reel default: Hook → Build → Payoff → CTA
Ad default: Problem → Solution → Proof → CTA
Brand story: Atmosphere → Climax → Logo reveal

The shot order comes from the narrative pattern, not from whatever pops into your head.

Step 6 — Post-Production Checklist

Because the shots are generated independently, post-production stitching is where it all comes together. The skill always outputs:

[ ] Stitch in [CapCut/Descript/DaVinci] — platform-appropriate tool
[ ] Apply LUT for color consistency — specific LUT suggestion
[ ] Add transitions — types and durations per cut
[ ] Layer BGM — genre, BPM, mood
[ ] Text overlays — hook copy and CTA
[ ] Export — platform-specific specs (9:16 1080×1920 30fps for TikTok, etc.)

Walkthrough: 30-Second TikTok for a Coffee Shop Opening

Let me show you what this looks like end-to-end. The brief:

"30-second TikTok Reel for a specialty coffee shop opening next Saturday. Warm, analog, hand-crafted vibe. CTA: 'Opening Saturday.'"

Visual Theme:

Palette: espresso brown, cream, muted amber, sage green
Lighting: warm golden backlight, motivated window light, 3200K
Lens: shallow DOF, 35mm full-frame
Film: subtle 16mm grain
Motion: locked-off or very slow push-ins

Shot 1 (0-5s) — Hook: The Pour (ECU, overhead)

Extreme close-up overhead shot of hot water pouring from a brass gooseneck kettle into a white ceramic V60 dripper filled with dark coffee grounds, the grounds blooming and rising in slow motion, warm golden backlight with visible steam curling upward, shallow depth of field, 35mm full-frame look, subtle 16mm film grain, deep espresso brown and cream color palette, muted saturation, cinematic 1080p, synchronized audio, 5 seconds, 9:16 vertical

Notice how much of the prompt is visual theme values repeated. That's intentional.

Shot 2 (5-10s) — Setting: The Space (MWS, slow dolly in)

Medium wide shot slow dolly forward into a cozy specialty coffee shop interior, warm morning sunlight streaming through large windows on the left, reclaimed dark wood counter in sharp focus with shelves of handmade ceramic mugs blurred in background, dust motes visible in sun rays, muted sage green wall accents, shallow depth of field, 35mm full-frame look, subtle film grain, warm 3200K color temperature, cinematic 1080p, synchronized audio, 5 seconds, 9:16 vertical

Same palette. Same lens. Same film grain. Different subject, same world.

Shots 3-6 follow the same pattern. The full 6-shot example is in the skill's examples folder if you want to see the whole thing.

Why it works: Shot 1 is sensory (hot water, steam) — the hook. Shot 2 establishes the space. Shot 3 adds a human face (the barista) — the emotional center. Shot 4 is a tactile detail (coffee beans) — signals craft. Shot 5 is aspirational (a customer enjoying the moment) — gives viewers a reason to care. Shot 6 is the CTA reveal. Every shot has a purpose in the emotional arc, and every shot shares the same visual world.

Why Repeating the Visual Language in Every Prompt Matters

This is the part that feels wasteful but is actually the point.

When you generate shot 1 and shot 2 as independent prompts, each generation resets the model's "state." There's no memory of "the last shot was warm 3200K with shallow DOF." If you don't explicitly repeat it, the model will pick its own lighting and lens for shot 2, and you'll get visual whiplash.

The repetition isn't for the model's benefit. It's for your benefit — because it forces you to commit to a visual language before you start generating.

Once you've written "warm golden backlight, 3200K, shallow DOF, 35mm full-frame look, 16mm film grain" six times, you can't half-ass any shot. Every generation is anchored to the same ground truth. That's where consistency comes from.

Where the Skill Fits

I packaged this as a Claude Code skill so you can invoke it like:

Use the ai-video-storyboard skill to plan a 30s TikTok
for my specialty coffee shop opening. Warm analog vibe.

And get the full 6-shot storyboard back in one response. The skill file itself is a single markdown file with frontmatter — dead simple. It also works as:

A .cursorrules file for Cursor
A .windsurfrules file for Windsurf
Custom instructions for ChatGPT / Claude.ai / any LLM

The whole thing is MIT licensed. Free, no account, no signup. If you have more example briefs you want to see as storyboards, open an issue on the repo.

Post-Script: The Broader Pattern

The "shared grammar across many independent generations" problem isn't unique to video. It shows up everywhere in AI content creation:

Image generation — every image in a brand style guide needs the same visual language
Voice cloning — a multi-segment narration needs consistent pacing and emotional tone
Code generation — a feature split across many files needs consistent naming, style, patterns

The solution pattern is the same: a constraint layer above the individual generation that every call has to respect. For video, that's the Visual Theme block. For brand images, it's a style guide. For code, it's project conventions or a linting config.

The stuff AI tools are bad at is rarely the individual generation. It's the coordination across generations. If you find yourself making the same creative decision 10 times and getting slightly different answers each time, you need a constraint layer.

Try It

Skill repo: https://github.com/aicontentskills/ai-video-storyboard-skill
Full worked example: 30s Coffee Shop TikTok

If you want to generate a single video clip, you can try happy horse model, the #1 on Artificial Analysis — delivering expressive motion, precise lip sync, and 1080p cinematic quality in seconds.

What problems are you hitting when you try to make multi-shot AI videos? I'd love to hear in the comments — especially if you've found a different way to enforce visual consistency.

DEV Community