DEV Community

Marcus Rowe
Marcus Rowe

Posted on • Originally published at techsifted.com

How to Make YouTube Shorts with AI in Under 5 Minutes

Five minutes is an aggressive claim. Let me be precise about what it means: after you've run this workflow once, subsequent Shorts take me roughly 5 minutes of active work. First time through, including setting up accounts and learning each tool, plan for 45-90 minutes.

That's still pretty fast for a finished, publishable video. Let me show you exactly how it works.

The Full 5-Minute AI Workflow

The workflow has six steps. Every single step has a free option. I'll note the paid upgrades that are worth considering, but you can run this entire pipeline at zero cost.

Step 1: Script (30-60 seconds with AI)
Step 2: Voiceover (ElevenLabs, free tier)
Step 3: Visuals (Pika or Kling for B-roll, or stock)
Step 4: Captions (auto-caption in CapCut)
Step 5: Music (Suno free tier)
Step 6: Assembly (CapCut or Descript)

I'm going to walk through each step and then show you a real example -- an actual Short I built using this exact workflow.

Step 1: Script

Shorts run 15-60 seconds. At speaking pace, that's roughly 40-160 words. Short.

The mistake most people make is writing a full script and then chopping it down. Better approach: write specifically for the format from the start.

Here's the ChatGPT prompt template I use:

"Write a YouTube Shorts script (45-55 seconds when spoken aloud, conversational pace) about [TOPIC]. Format: hook in the first 3 seconds that creates curiosity, 3-4 punchy information points, strong CTA at the end. No intro, no 'Hey guys,' no padding. Make it sound like a knowledgeable person talking, not a script being read."

For a Short about AI tools, I'd fill in: "the 3 most useful free AI tools most people don't know about."

The output is usually 90% there. I edit for my voice -- add a specific example, change something that sounds too formal, cut a sentence that's redundant. Takes 30-60 seconds.

One thing worth knowing: YouTube Shorts rewards retention. Every sentence should make the viewer think "wait, I need to hear the next part." Write the script with that in mind -- front-load the value, don't save the good stuff for the end.

Step 2: Voiceover with ElevenLabs

ElevenLabs generates synthetic voiceovers in various voices (or your cloned voice) from text. The free tier gives you 10,000 characters per month -- more than enough for Shorts.

For a 50-word script, the free tier covers about 200 Shorts per month. You're not going to run out.

Workflow:

  1. Go to ElevenLabs, create a free account
  2. Navigate to "Speech Synthesis"
  3. Paste your script
  4. Select a voice (try "Adam" or "Rachel" for natural-sounding delivery)
  5. Adjust stability and clarity sliders (start at default, adjust if the output sounds too robotic or too erratic)
  6. Generate and download MP3

The quality on ElevenLabs free tier is genuinely good. For most audiences, you can't distinguish it from a human voiceover at normal playback speed. In the Shorts context -- phone speakers, often with captions on -- it's more than sufficient.

Optional upgrade: ElevenLabs paid plans start at $5/month and include voice cloning (train it on your voice, generate voiceovers that sound like you). For building a consistent presence without having to record every video yourself, the voice clone is compelling. See our full How to Clone Your Voice with AI guide for the complete step-by-step process, and the ElevenLabs Review 2026 for detailed pricing and feature breakdown. Want to compare all major AI voice tools? The Best AI Voice Generators 2026 roundup covers everything.

Step 3: Visuals -- AI-Generated B-roll

This is the step most guides skip over. You have voiceover audio. What do you show on screen?

Option A: Show yourself on camera (traditional talking-head). Valid, no AI needed for visuals.

Option B: Generate AI visuals using Pika or Kling. This is what makes the workflow fully AI-native.

For each sentence or concept in your script, generate a matching 3-5 second video clip:

Using Kling AI (recommended for realism):

  • Go to Kling AI, use your daily free credits (66/day)
  • Prompt: describe the visual you want -- "aerial drone shot over a modern city at night with glowing lights" or "close-up of hands typing on a keyboard, soft blue light" -- short, descriptive, specific
  • Generate at 5 seconds
  • Download

Using Pika (recommended for stylized/animated look):

  • Go to Pika, use free tier
  • Same prompting approach, but add a style modifier: "anime style" or "illustrated, painterly" if you want a non-realistic aesthetic
  • Pika's style presets are faster than manually prompting for specific aesthetics

For a 45-second Short, you need roughly 8-10 clips. At 5-10 minutes of generation time each, this is the longest step -- but it runs in the background while you're doing other things.

Alternative: Free stock footage. Pexels Video (pexels.com/videos) has free HD and 4K footage. Not AI-generated, but completely free and often faster than generating the exact clip you need. For abstract topics, stock footage is underrated.

See our Pika vs Kling comparison for more detail on which platform fits which use cases.

Step 4: Captions

Non-negotiable in 2026. Shorts without captions lose a significant portion of their audience -- many viewers watch without sound, and even viewers with sound on read along with captions.

CapCut (free, recommended): Import your video, click "Auto Captions," select language, done. The accuracy is high on clean voiceover audio. CapCut also offers style presets for captions -- the animated word-by-word highlight style that's common on viral Shorts.

Descript (free tier, 1 hour/month): Import your voiceover audio, generate transcript, export with burned-in captions. Descript's caption export is cleaner for professional formatting.

For Shorts specifically, I like CapCut's caption styles better -- they're tuned for the format. For longer YouTube videos, Descript is more flexible.

Caption placement: center screen works well for Shorts viewed on phone. Don't put captions too low (they get cut off on some devices) or too high (conflicts with the title area).

Step 5: Music with Suno

Background music changes the feel of a Short more than most creators realize. Silence under a voiceover sounds clinical. The right background music makes the same script feel energetic, emotional, or professional.

Suno generates original music from text prompts and has a free tier (50 songs per day on the current free plan -- they've adjusted this a few times, so verify current limits).

My prompting approach:

  • Keep it simple: "upbeat background music, no lyrics, positive energy, corporate"
  • Or: "ambient electronic, minimal, tech-focused, no lyrics"
  • Or: "light acoustic guitar, warm, approachable"

Generate 3-4 options and use the one that fits your specific content. Generation takes about 30 seconds.

The important thing: keep the music volume low. The voiceover should be clearly dominant. Music is texture, not centerpiece. In CapCut, I typically run background music at 15-25% volume under the voiceover.

Step 6: Assembly in CapCut

CapCut is free, available on both desktop and mobile, and has become the de facto standard for Shorts and TikTok editing. It's genuinely good for this format.

The assembly process:

  1. Create a new project, select 9:16 ratio (vertical, standard for Shorts)
  2. Import your video clips (AI-generated B-roll or stock)
  3. Import your voiceover audio -- lay it on the audio track
  4. Trim video clips to match voiceover pacing
  5. Add captions (use CapCut's auto-caption feature)
  6. Import background music track, set volume low
  7. Add transitions (subtle: fade or cut. Skip the spinning star transitions)
  8. Export at 1080x1920 (1080p vertical)

Total assembly time, once you have your assets: 5-10 minutes in CapCut for a 45-second Short.

Real Example: "5 AI Tools Roundup" Short

Let me walk through the actual Short I built using this workflow.

Topic: "3 AI tools that actually save me time every week" -- a quick showcase format that does well on Shorts because it's specific and immediately useful.

Script I used:

"Three AI tools I use every week that actually work. First: ElevenLabs -- I generate voiceovers for videos without recording anything. It's my voice, digitally. Second: Kling AI -- free AI video clips, 66 a day. I use it for B-roll when I don't have footage. Third: Descript -- edit podcasts and YouTube by editing text. Delete a word from the transcript, it cuts the audio. That's it. These three tools cut my production time in half. Links in the description."

Word count: ~85 words. At normal speaking pace, about 40 seconds. Good.

Visuals I generated:

  • ElevenLabs concept: "futuristic waveform audio visualization, blue and purple, dark background" (Kling)
  • Kling concept: "AI video generation interface on a computer screen, dark UI, glowing elements" (Kling)
  • Descript concept: "text document with waveform below, editing interface, light and clean" (Kling)
  • Plus two transition clips: "abstract data visualization, flowing lines" (Pika)

Total generation time: ~25 minutes (ran in the background while I worked on other things)

Assembly: 7 minutes in CapCut. Captions applied automatically. Background music from Suno (ambient electronic, 20% volume).

Result: A 40-second Short that looks professional, cost nothing to produce, and covers a useful topic for the audience I'm building.

Budget Options at Every Step

For completeness, free options at each step:

Step Free Option Paid Upgrade
Script ChatGPT free tier ChatGPT Plus ($20/mo)
Voiceover ElevenLabs free (10k chars/mo) ElevenLabs Starter ($5/mo, voice clone)
Visuals Kling free (66 credits/day) Kling paid ($7/mo, premium quality)
Captions CapCut free CapCut Pro ($8/mo)
Music Suno free Suno Pro ($8/mo)
Assembly CapCut free Descript Hobbyist ($24/mo)

You can run the entire workflow at $0. The paid upgrades improve quality or volume limits, but aren't required to start.

Common Mistakes (and How to Avoid Them)

Mistake 1: Making the hook too slow. The first 2-3 seconds determine whether someone scrolls. Start with the most interesting thing you're going to say -- not a greeting, not an intro, not "today I'm going to tell you about." Jump straight in.

Mistake 2: Generating too many visual styles. If your B-roll mixes realistic Kling clips, anime-style Pika clips, and stock footage, the visual inconsistency is jarring. Pick one aesthetic and stick with it for the whole Short.

Mistake 3: Overloading captions. Captions should feel like they're helping, not competing with the video. Keep fonts clean, size readable, avoid distracting animations for every single word.

Mistake 4: Setting music too loud. If I can't understand the voiceover because the music is competing, I'm out. Background music at 15-25% is almost always right.

Mistake 5: Skipping the iteration step. Generate your first Short and post it. Then generate another one next week. The improvement from iteration one to iteration ten is dramatic. Don't optimize endlessly before starting.

Getting Better Results from AI Video

The biggest variable in AI-generated Shorts quality is prompt specificity. "Show a futuristic city" produces generic output. "Aerial drone shot above a glowing neon-lit city at night, fog settling in the streets, cyberpunk aesthetic" gives the AI something to work with.

Specific details that help:

  • Camera angle (aerial, close-up, wide shot, side profile)
  • Lighting (golden hour, neon, studio lighting, backlit)
  • Motion (slow pan, static shot, zoom in)
  • Aesthetic (realistic, anime, illustration, cinematic)
  • Time of day / environment

Good prompt, good output. Vague prompt, vague output.

What You'll Need (Complete Checklist)

  • ChatGPT free account (script)
  • ElevenLabs free account (voiceover)
  • Kling AI free account (video B-roll)
  • Pika free account (stylized B-roll, optional)
  • Suno free account (music)
  • CapCut free (assembly and captions)
  • A YouTube channel (if you don't have one, creating it takes 5 minutes)

Total setup time: 30-40 minutes for accounts. After that, the workflow takes 5-15 minutes per Short of active work.

The Honest Part

This workflow produces good Shorts, not great Shorts. The ceiling is lower than what you'd create with your own on-camera presence, professional audio, and original footage.

What AI workflow gives you: speed, accessibility, and the ability to produce content consistently without recording yourself every time.

What it doesn't give you: your personality on screen, the kind of parasocial connection that builds loyal audiences, the spontaneous moments that make video feel real.

The best creators I've watched build Short-form audiences combine both: AI-assisted workflow for speed and volume, with real on-camera moments mixed in for connection. The tools do the production work; you bring the perspective and voice.

Start with the AI workflow to build the habit and test what topics your audience responds to. Add more of yourself as you find your footing.

For the detailed comparison of Pika vs Kling specifically, see Pika vs Kling AI: Budget AI Video Generators Compared. For the full landscape of AI video tools at every price point, Best AI Video Generators 2026 has the complete picture. And if you want to go deeper on AI video for marketing content, How to Create Marketing Videos with AI covers the longer-form workflow.

Top comments (0)