DEV Community

Cover image for Building AutoShorts: A High-Performance AI Pipeline for Automated Viral Content 🎬🤖
divyaprakash D
divyaprakash D

Posted on

Building AutoShorts: A High-Performance AI Pipeline for Automated Viral Content 🎬🤖

The Problem: Content Creation is a Bottleneck

Every creator knows the "highlight reel" struggle. You have hours of high-quality gameplay footage, but finding that perfect 30-second clip, cropping it, adding subtitles, and layering a voiceover takes hours of manual labor.
I wanted to see if I could build a fully automated, high-performance pipeline to handle this from start to finish. Today, I'm open-sourcing AutoShorts.
AutoShorts Architecture Architecture

What is AutoShorts?

AutoShorts is a GPU-optimized CLI tool that analyzes long-form video, identifies high-engagement scenes using AI, and synthesizes them into ready-to-upload vertical shorts.
It doesn't just "cut" video; it understands it.

The Technical Deep Dive 🛠️

To keep processing times low and avoid massive cloud API bills, I focused heavily on local processing and hardware acceleration:

1. GPU Scene Analysis ⚡

Using decord and PyTorch, the pipeline performs frame extraction and visual feature analysis directly on the GPU. We calculate action density and spectral flux to find "loud" or "fast" moments before the text-based AI even sees the clip.

2. Dual-AI Intelligence đź§ 

The pipeline integrates with OpenAI (GPT-4o) and Google Gemini. We pass the metadata and scene descriptions to the LLM to score segments based on:
Hook Potential: Is the start grabby?
Relevance: Does the action make sense?
Emotional Impact: Is it funny, impressive, or a "fail"?

3. Smart Subtitles & Neural TTS 🗣️

Local TTS: Instead of paid APIs, we use ChatterBox locally. It supports emotional prosody, so the voiceover doesn't sound like a monotone robot.
PyCaps Renderer: We use a custom Playwright-based renderer to create those "MrBeast style" word-by-word animated captions that are essential for mobile retention.

4. NVENC Rendering 🎞️

Final assembly—including audio mixing, blurring backgrounds (for the vertical look), and burning in subtitles—is offloaded to NVIDIA’s NVENC hardware. This keeps the CPU free for other tasks and slashes render times.

🚧 What’s Next? (The Roadmap)

This is a v1.0 release, and while the pipeline is robust, the potential for enhancement is huge. I’m looking for contributors to help with:
Upgrading the Voice Engine: Integrating more recent open-source models like ChatterBoxTurbo, Qwen-TTS, or NVIDIA’s latest TTS for even more realistic voice cloning and prosody.
Intelligent Auto-Zoom: Currently, the 9:16 crop is centered. Adding object detection (YOLO/RT-DETR) to follow the action—dynamically moving the crop window to follow a character or a vehicle.
Advanced Transition Styles: Adding AI-generated transitions between merged scenes.

Build With Me 🚀

The project is fully dockerized and open for contributions. Whether you're interested in machine learning, computer vision, or just want to automate your own YouTube channel, I'd love to see you in the PRs.
GitHub Repository: github.com/divyaprakash0426/autoshorts
A huge thanks to the original concepts from artryazanov and Binary-Bytes which provided the foundation for this refactored release.
What features would you add to an AI video pipeline like this? Let's discuss in the comments! 👇

Top comments (0)