Ryan Banze

Posted on Sep 12 • Edited on Oct 16

🎞️AI-Powered Shorts Generator: Building Automated Karaoke-Style Video Pipelines in Google Colab

#ai #python #contentcreation #machinelearning

Why Short-Form Video + AI Is the Future

In 2025, the short-form video is not just entertainment, it’s the only dominant communication medium.

From YouTube Shorts to TikTok to Instagram Reels, billions of daily views flow through highly engaging, bite-sized content.

But behind the scenes, creating even a single 30-second professional-quality video requires:

Storyboarding (what do we say?)
Scriptwriting (how do we say it?)
Narration/voiceover (recording, syncing)
Video sourcing or shooting
Editing + captioning
Music layering + final rendering

That’s hours of manual work. Now imagine doing this at the scale modern creators or startups require, dozens of videos per week.

Enter AI-powered video pipelines. By combining generative AI (Gemini, Mistral), open-source models (WhisperX), and developer tools (MoviePy, Colab, APIs), we can fully automate the workflow: from idea → to script → to captions → to final video.

This isn’t just a productivity hack. It’s the blueprint for AI-native media factories—a future where anyone can generate branded, engaging, and personalized shorts at scale.

What Is the AI Shorts Generator?

The AI Shorts Generator is a Google Colab-based pipeline that:

Finds relevant stock clips via the Pexels API.
Uses Gemini 1.5 Flash to caption and describe the scene.
Writes matching narration scripts using Mistral 7B or Gemini.
Converts text into realistic voiceovers via Edge-TTS, gTTS, or pyttsx3.
Adds background music for mood/energy.
Runs WhisperX alignment to sync words → captions → voiceover.
Outputs a karaoke-style video with professional polish.

All of this happens inside Colab—no After Effects, no Premiere, no manual syncing.

Technical Architecture

🔑 Secure API Key Input

Securely collect user credentials for:

OpenRouter for Mistral LLM
Google AI Studio for Gemini
Pexels for video search

python
from getpass import getpass
openrouter_api_key = getpass("🔐 Enter your OpenRouter API key: ")
google_ai_studio_api_key = getpass("🔐 Enter your Google AI Studio API key: ")
pexels_api_key = getpass("🔐 Enter your Pexels API key: ")

1. Data Ingestion: Stock Video Retrieval

• **API Used:** [Pexels API](https://www.pexels.com/)
• Query strings like "motivation", "nature", "city hustle" return thematic clips.
• Clips are filtered by resolution, duration, and orientation.

videos = search_pexels_videos("motivation", per_page=5)
best = videos[0]
video_file = download_video(best["url"], prefix="pexels_nature")

Why it matters: You avoid copyright headaches, plus video sourcing is automated.

2. Scene Captioning with Gemini

• Model: Gemini 1.5 Flash (Google Generative AI)
• Input: Middle frame of the video (extract_preview_frame).
• Output: Rich textual description (e.g., “A sunrise over misty mountains, golden light cascading on clouds”).

img = extract_preview_frame(video_file)
sample_image = Image.open(img)
encoded_image = file_to_base64(img)
response = gemini.generate_content([
    {"mime_type": "image/jpeg", "data": encoded_image},
    "Describe this scene in rich detail."
])
caption = response.text

Model used: gemini-1.5-flash from Google Generative AI.
Why it matters: Enables vision-to-text, bridging raw video frames to natural language.

3. Narration Script Generation

• **Option A:** Gemini generates script matching clip mood.
• **Option B:** Mistral 7B via OpenRouter provides lightweight, creative scripting.

We select a TTS voice and generate narration based on the caption and duration:

all_voice_options = await get_all_tts_voices()
selection = prompt_voice_selection_with_json_gemini(caption, duration, all_voice_options)
parsed = parse_voice_selection(selection)

Why it matters: Narration isn’t just “describing.” It’s shaping emotional resonance (inspiration, calm, excitement).

Generate the script using Gemini or Mistral:

narration = generate_narration_from_visual(caption, duration)

4. Voice Synthesis (TTS Engines)
• Edge-TTS → Natural voices (best quality).
• gTTS → Quick online solution.
• pyttsx3 → Offline fallback.

Convert the narration into speech with chosen engine:

output_voice_path = await generate_voice_dynamic(narration, duration, parsed)

Why it matters: Multiple backends = reliability + flexibility.

5. Background Music Integration

• **Royalty-free tracks** (e.g., Kevin MacLeod’s library).
• Auto-volume balancing via **MoviePy**.

music_path = "/content/And Awaken - Stings - Kevin MacLeod.mp3"
Audio(music_path)

Compose final video:

final_path = generate_final_video_with_audio(video_file, music_path, output_voice_path)
play_video(final_path)

6. Word-Level Alignment with WhisperX

WhisperX refines timing → ensures every spoken word syncs with captions.

audio = whisperx.load_audio(output_voice_path)
model = whisperx.load_model("medium", device="cpu")
result = model.transcribe(audio)

WhisperX returns segments and timings.
Why it matters: Karaoke-style captions = higher retention, accessibility, and “pro” feel.

7. Rendering Karaoke Captions
• Fonts loaded dynamically.
• Highlight style applied with PIL + MoviePy overlays
• Final export

model_a, metadata = whisperx.load_align_model(language_code=result["language"], device="cpu")
aligned = whisperx.align(result["segments"], model_a, metadata, audio, device="cpu")

FONT_PATH = find_font()
out_path = generate_karaoke_video(
    video_file,
    music_path,
    output_voice_path,
    aligned,
    output_path="karaoke_final.mp4",
    show_transcript_subtitles=False
)
play_video(out_path)

This produces a final video with:
• Highlighted words synced to narration
• Optional sentence subtitles
• Music and voiceover merged

Workflow Visualization

mermaid
flowchart TD
    A[Video Search: Pexels API] --> B[Scene Caption: Gemini AI]
    B --> C[Narration Script: Mistral/Gemini]
    C --> D[Voiceover: Edge-TTS/gTTS/pyttsx3]
    D --> E[WhisperX Alignment]
    E --> F[MoviePy Rendering]
    F --> G[Final Karaoke-Style Short]

Feature Comparison

Factor	Manual Editing 🎬	AI Shorts Generator 🤖
Time per 30s video	3–5 hours	10–15 minutes
Tools needed	Premiere/AE	Colab + APIs
Cost	$100+/month	Free/Open Source
Technical skills	High	Beginner-friendly
Scalability	Low	High (batch-ready)
Captions	Manual	Auto-aligned karaoke
Personalization	Manual script	AI-driven tone/style

Security Considerations
• API keys handled via getpass() in Colab → no hardcoding.
• .env management for reuse.
• Limits: Pexels free tier (200 requests/hr), OpenRouter billing per token.

Practical Use Cases

Creators → Generate daily Shorts without burnout.
Educators → Narrated micro-lessons with accessibility captions.
Wellness apps → Meditation/affirmation clips at scale.
Startups → Quick marketing creatives without agencies.
Personal branding → Automate storytelling on LinkedIn/TikTok.

Future Roadmap

The current Colab pipeline is a proof of concept. Scaling it could mean:
• Custom fine-tuned narrators (brand voices).
• Emotion-aware music selection (AI matching tone).
• Multi-language support (WhisperX multilingual alignment).
• Real-time video generation APIs → SaaS platform.
• Drag-and-drop GUI → No-code app for non-tech creators.

Credits & Tools

• **Gemini 1.5** by Google AI
• **Mistral 7B** via OpenRouter.ai
• **WhisperX**: Enhanced Whisper with word-level alignment
• **MoviePy**: Pythonic video editing
• **PIL**: Image drawing for subtitles
• **Pexels API**: Free stock videos
• **TTS engines**: gTTS, Edge-TTS, pyttsx3
• **Music**: Kevin MacLeod via incompetech.com

Conclusion

The AI Shorts Generator isn’t just a fun Colab notebook, it’s a prototype of media automation in action.
• It reduces hours → minutes.
• It merges vision, text, and sound seamlessly.
• It shows how developers can move from tinkering → to building full-scale AI content engines.
The next wave of media won’t be “edited.” It will be generated.
And projects like this are the bridge. Fork it. Test it. Extend it.
This is how you build your own AI-powered media pipeline in 2025.

Like what you see?
⭐️ Star the repo
🎥 Share your montage
💬 Let us know what you’re building with it!