divyaprakash D

Posted on Feb 13 • Edited on Feb 15

Stop Editing. Start Playing. Meet AutoShorts: The AI Gaming Editor 🎮

#devchallenge #githubchallenge #cli #githubcopilot

GitHub Copilot CLI Challenge Submission

This is a submission for the GitHub Copilot CLI Challenge

What I Built

AutoShorts is an AI-powered pipeline that automatically transforms long-form gameplay footage into viral-ready vertical clips. It uses Vision AI to semantically understand content—distinguishing between "action," "clutch plays," and "WTF moments"—then adds AI-generated captions and AI voiceovers with matching energy and personality.

The result? Hours of gameplay → polished TikTok/Shorts/Reels-ready clips, with minimal human intervention.

Demo

View Project on GitHub: Link

Demo Video:

🎥 Showcase: Multi-Language & Style Generation

AutoShorts automatically adapts its editing style, captions, and voiceover personality based on the content and target language. Here are some examples generated entirely by the pipeline:

Content	Style	Language	Video
Fortnite	Story Roast	🇺🇸 English	Watch Part 1
Indiana Jones	GenZ Slang	🇺🇸 English	Watch Part 1
Battlefield 6	Dramatic Story	🇯🇵 Japanese	Watch Part 1
Indiana Jones	Story News	🇨🇳 Chinese	Watch Part 1
Fortnite	Story Roast	🇪🇸 Spanish	Watch Part 1
Fortnite	Story Roast	🇷🇺 Russian	Watch Part 1
Indiana Jones	Auto Gameplay	🇧🇷 Portuguese	Watch Part 1

📸 Dashboard Interface

1. Generate Page
The command center for creating new content. Simply drop a video or select an existing one, choose your analysis mode (Local vs. Cloud), and hit "Find Clips."

2. Settings & Cost Control
Full control over which AI models are used and strictly managed API costs. You can toggle between OpenAI, Gemini, or efficient Local Heuristics.

Why I Built This

I had a problem that every content creator knows: hours of gameplay footage, but no time to edit.

Recording gameplay is the easy part. The hard part is scrubbing through 2-hour VODs looking for that one clutch moment, that hilarious fail, or that "wait, what just happened?" clip. Then you need to:

Find the moment
Crop to vertical (9:16)
Add captions that match the vibe
Maybe add commentary or voiceover
Export and repeat... dozens of times

I was spending 3-4 hours editing for every hour of footage. That's backwards.

I wanted a system where I could:

Drop a raw gameplay file
Walk away
Come back to ready-to-upload clips with captions and voiceovers

AutoShorts is that system.

How I Built It (Technical Deep-Dive)

Building AutoShorts was a rollercoaster of "this is genius" moments immediately followed by "why is everything on fire." Here's the real story — the problems nobody warns you about, and the solutions that made it all work.

The Architecture Challenge

When the feature set started growing — Vision AI analysis, TTS voice synthesis, story narration, cross-clip narrative arcs — it became clear that a single orchestration file wasn't going to cut it. Every new feature touched everything else, and debugging felt like untangling christmas lights.

The fix was Domain-Driven Design — splitting the logic into focused modules, each owning its piece of the pipeline:

src/
├── shorts.py              # Orchestration & rendering
├── ai_providers.py        # Gemini/OpenAI abstraction
├── tts_generator.py       # Qwen3-TTS voice synthesis
├── subtitle_generator.py  # Caption generation & timing
└── story_narrator.py      # Cross-clip narrative generation

This separation seemed like overkill at first. Then I discovered I needed to load and unload AI models from GPU memory between pipeline stages — TTS has to yield VRAM for rendering, which has to yield for AI analysis — and suddenly having clean boundaries between modules was the only thing keeping me sane.

The VRAM Juggling Act

Here's the thing about running AI models on consumer GPUs: they don't share nicely.

Qwen3-TTS (voice synthesis) needs ~4GB VRAM. Video rendering with PyTorch needs ~2GB. These models don't politely step aside for each other — they sit in VRAM until you physically evict them.

The solution was aggressive model lifecycle management — singleton patterns with explicit cleanup:

# After TTS generation completes
QwenTTS.clear_instance()
torch.cuda.empty_cache()
gc.collect()
logging.info("TTS model unloaded — VRAM freed for rendering")

Without this, the pipeline would OOM (out-of-memory crash) after processing 2-3 clips. Fun times at 2 AM when you're wondering why clip #3 always segfaults.

The Qwen3-VL Dead End: When "Local" Goes Too Far

I desperately wanted the entire video analysis to happen locally. I actually got Qwen3-VL (video-language model) integrated and working, but it was a textbook case of "just because you can, doesn't mean you should."

Qwen3-VL is a monster. It’s not just big; it's VRAM-hungry beyond reason. My 12GB RTX 4080 laptop didn't stand a chance, and even on high-end 24GB cards, it would regularly hit the OOM wall during long video sequences.

I attempted a last-ditch effort using Qwen3-VL-4B-Instruct-FP8, but even with quantization, the stability wasn't there—it still occasionally nuked the pipeline. Worse, the analysis quality didn't justify the struggle; the results were underwhelming compared to the resource cost. It felt like I was trying to race a semi-truck on a go-kart track.

The pivot: This failure is actually what led to the Deep Analysis Proxy system. I realized that instead of fighting 30GB models locally, I could spend those dev cycles on intelligent preprocessing (the 15MB proxy) and let a cloud model do the heavy lifting for pennies. The result was a pipeline that's actually accessible to people with consumer GPUs, rather than just data center owners.

The TTS Timing Nightmare

This was the most infuriating bug I encountered, and it took three separate debugging sessions to crack.

The problem: Subtitles and voiceover were drifting out of sync in story mode. By the end of a 60-second clip, subtitles were 3-4 seconds ahead of the voice. Not great when you're going for "professional esports broadcast" and getting "badly dubbed foreign film."

The investigation:

Story mode generates a continuous narration (like a broadcaster). The TTS engine reads all sentences as one flowing piece. But subtitles were timed by probing each sentence individually:

Subtitle timing (probed separately):
  "The player approaches" → 2.3s
  "An incredible shot"    → 1.8s
  Total: 4.1s

TTS (generated as merged text):
  "The player approaches an incredible shot" → 3.6s

See the problem? When you join sentences, the TTS naturally flows faster — no pause between them. That 0.5s error accumulated across every sentence.

The fix: Probe the merged narration once, then distribute timing proportionally:

# ❌ Wrong: probe each sentence separately
for sentence in sentences:
    duration = probe_tts(sentence)  # Accumulated error!

# ✅ Right: probe merged text, distribute proportionally
full_narration = " ".join(sentences)
total_duration = probe_tts(full_narration)
for sentence in sentences:
    sentence_duration = total_duration * (len(sentence) / total_chars)

One of those fixes where you stare at the solution and think "why didn't I see this three days ago?"

The "TTS Longer Than Video" Problem

Sometimes the AI writes an essay when you asked for a tweet. A 45-second gameplay clip ends up with 52 seconds of narration. Now what?

Three options on the table:

Option A: Truncate the voiceover → Loses content, sounds cut off
Option B: Speed up the voice → Sounds like a chipmunk reading the news
Option C: Extend the video to match → 🤔

Option C won, but with nuance:

if tts_duration > clip_duration + 1.5:
    # Big gap: go back to source video, extract more footage
    rerender_clip_for_tts(clip, render_meta, tts_duration + 1.0)
else:
    # Small gap: freeze last frame using FFmpeg tpad
    ffmpeg_filter = f"tpad=stop_mode=clone:stop_duration={gap}s"

The re-render logic reaches back into the original source video and extracts more footage — even beyond the original scene boundaries. This required tracking render metadata (start time, source file, scene duration) through the entire pipeline. Worth it though. No more cut-off narration.

FlashAttention: When Your RAM Isn't Enough

Qwen3-TTS performs best with FlashAttention 2 — a CUDA kernel that speeds up attention computation by 3-4x. One problem: building it from source requires compiling CUDA code, which needs 125GB+ RAM during compilation. On machines with less than 32GB RAM, the build takes 24 hours or more — if the OOM killer doesn't murder it first.

My machine has 16GB. Killed — my favorite one-word error message.

The solution? Prebuilt wheels. Someone lovely had already compiled FlashAttention for various PyTorch + CUDA combinations:

install_flash_attn:
    @PYVER=$$(python -c "import sys; print(f'cp{sys.version_info.major}{sys.version_info.minor}')"); \
    pip install https://github.com/.../flash_attn-2.6.3+cu128torch2.10-$$PYVER-linux_x86_64.whl

One line. No compilation. No 125GB RAM requirement. Installation went from "impossible on my hardware" to "done in 30 seconds."

Deep Analysis: Letting AI See the Full Picture

Here's an insight that changed everything: short clips lack context.

In the default mode, each candidate clip is analyzed independently — the AI sees 2 minutes of footage and scores it. But it doesn't know what happened before or after. A celebration makes no sense without the clutch play that preceded it.

Deep Analysis mode fixes this by letting Gemini see the entire video — but we're not about to upload a multi-GB 4K recording raw. That would take forever and burn through API quotas.

Instead, we generate a lightweight proxy first using GPU-accelerated FFmpeg:

# GPU-accelerated proxy: 4K@60fps → 640p@1fps, high compression
gpu_cmd = [
    "ffmpeg", "-y",
    "-hwaccel", "cuda",
    "-hwaccel_output_format", "cuda",
    "-i", str(video_path),
    "-vf", "scale_cuda=640:-2,fps=1",   # 640px wide, 1 frame per second
    "-c:v", "hevc_nvenc",
    "-qp", "35",                         # Aggressive compression
    "-c:a", "aac", "-b:a", "32k", "-ac", "1",  # Mono 32kbps audio
    str(temp_proxy)
]

A 2-hour 4K gameplay recording (~30GB) becomes a ~15MB proxy. Same content, same timeline, same audio cues — just tiny enough to upload in seconds. The proxy is also cached by file hash, so re-runs skip the generation step entirely.

The AI can now identify narrative arcs — the setup, the payoff, the aftermath. It finds moments that a clip-by-clip analysis would miss entirely. The quality jump is dramatic, and all it costs is a ~15MB upload instead of 30GB.

Voice Design: From Text to Personality

The most "wow" feature. Instead of picking from generic preset voices, you describe the voice you want in natural language:

VOICE_PRESET_MAP = {
    "story_news": """
        gender: Male.
        pitch: Dynamic, high-energy with excitement.
        speed: Brisk, fast-paced, maintaining high momentum.
        emotion: Hype, adrenaline, "unbelievable play" excitement.
        personality: Charismatic, knowledgeable, maximum energy.
    """,

    "story_dramatic": """
        gender: Female.
        pitch: Rich, resonant mid-range with expressive depth.
        speed: Measured, deliberate pacing with dramatic pauses for impact.
        emotion: Intense, evocative, drawing listeners into the story.
        personality: Wise, commanding, magnetic storyteller presence.
    """
}

Qwen3-TTS reads this description and synthesizes a matching voice. The same caption sounds completely different between "esports broadcaster" and "creepypasta narrator" — and it all happens locally. No cloud TTS API, no per-word billing.

Slang Preprocessing: Making TTS Sound Natural

TTS engines and internet slang do not get along. "rn" becomes "urn." "lol" becomes "loll." "fr fr" sounds like a French car brand.

The fix is a preprocessing layer that expands slang before TTS sees it:

def preprocess_tts_text(text):
    t = text
    t = re.sub(r'\brn\b', 'right now', t, flags=re.IGNORECASE)
    t = re.sub(r'\blol\b', 'L O L', t, flags=re.IGNORECASE)
    t = re.sub(r'\bidk\b', "I don't know", t, flags=re.IGNORECASE)

    # Qwen3-TTS doesn't pause at dashes, so swap them for ellipses
    t = t.replace(" -- ", "... ")
    t = t.replace(" - ", "... ")
    return t

Small detail, huge impact. GenZ-style captions like "bro that was lowkey insane rn fr fr" actually sound right when spoken aloud.

CJK Subtitle Handling: When Words Don't Have Spaces

English subtitles are easy — split on spaces, chunk into 7-word captions, done. But Japanese, Chinese, and Korean (JCK languages) don't use spaces between words. A sentence is one continuous stream of characters.

This completely broke the subtitle chunking logic. A 40-character Japanese sentence would appear as one massive wall of text filling the entire screen.

The fix was character-based splitting with language detection:

# Detect CJK characters in the sentence
is_cjk = any("\u4e00" <= char <= "\u9fff" or  # Chinese
              "\u3040" <= char <= "\u30ff"       # Japanese
              for char in sentence)

MAX_CJK_CHARS = 18  # Characters per line for CJK

if is_cjk:
    # Character-based splitting instead of word-based
    chunks = [sentence[i:i+MAX_CJK_CHARS]
              for i in range(0, len(sentence), MAX_CJK_CHARS)]
    # Distribute TTS duration proportionally by character count
    for chunk in chunks:
        chunk_ratio = len(chunk) / len(sentence)
        chunk_duration = tts_duration * chunk_ratio

The sentence splitter also handles CJK punctuation (。！？) which doesn't follow the English pattern of period-then-whitespace. These characters terminate sentences directly, no space required.

One of those "obvious in hindsight" fixes that makes multi-language support actually work instead of just being a checkbox on a feature list.

My Experience with GitHub Copilot CLI

Everything above? That's the engineering. But I'd be lying if I said I did it alone. GitHub Copilot CLI was my pair programmer through most of this — and here's how it actually helped.

Copilot CLI wasn't just autocomplete — it was a debugging partner, architecture consultant, and documentation writer rolled into one.

What Worked Exceptionally Well

1. Plan Mode for Complex Changes

Using [[PLAN]] prefix before major refactors gave me a structured approach:

[[PLAN]] Migrate from ChatterBox TTS to Qwen3-TTS VoiceDesign

Copilot generated a 6-phase plan covering dependency changes, API migration, FlashAttention setup, testing checkpoints, and rollback strategies. I could review and edit the plan before implementation started.

2. Debugging Across Sessions

The checkpoint system was crucial. When investigating the subtitle timing bug, I could reference earlier sessions:

"Check checkpoint 012-tts-subtitle-sync for what we tried before"

Copilot would review the history and avoid repeating failed approaches.

3. Parallel Exploration

When I wasn't sure which approach to take, I'd ask Copilot to spin up explore agents to investigate multiple paths simultaneously:

task agent_type: explore
prompt: "How does generate_for_captions() handle timing in story mode vs normal mode?"

This let me understand the codebase faster than reading linearly.

4. Test Generation

After making changes, Copilot helped write comprehensive tests:

def test_preprocess_tts_text_em_dash():
    result = preprocess_tts_text("wait — what")
    assert "..." in result
    assert "—" not in result

50 tests covering subtitle formatting, TTS preprocessing, voice description generation, and scene combination logic — all generated from understanding the code context.

What I Learned

Be specific about constraints. "Fix the OOM error" is less useful than "We have 10GB VRAM, model A needs 8GB, model B needs 4GB, how do we sequence them?"
Use checkpoints liberally. Complex debugging spans sessions. Good checkpoints save hours.
Let Copilot see the errors. Pasting full stack traces and logs gives it the context to diagnose accurately.
Trust but verify. Copilot's suggestions are usually good, but always run the tests.

The Pipeline Today

Here's what happens when you drop a gameplay video into AutoShorts:

Scene Detection — GPU-accelerated analysis finds candidate moments using audio spikes + motion detection
AI Ranking — Vision AI (Gemini/OpenAI) watches each clip and scores it across 7 semantic categories
Deep Analysis (optional) — GPU-downscaled proxy uploaded to Gemini for context-aware moment detection
Smart Selection — Diverse category selection ensures variety (not just all "action" clips)
GPU Rendering — NVENC hardware encoding creates vertical crops with blurred backgrounds
Caption Generation — AI writes contextual captions matching the clip's energy
Voice Synthesis — Qwen3-TTS creates matching voiceovers with style-appropriate personalities
Timing Sync — Subtitle timing synchronized with actual TTS audio duration
Smart Mixing — Game audio ducked during voiceover, video extended if TTS runs long

Total processing time: ~5-7 minutes per clip on an RTX 3080.

Analysis Modes & Cost

AutoShorts supports four analysis modes, each with different tradeoffs between cost, accuracy, and speed. You choose the mode via environment variables — no code changes needed.

How Each Mode Works

🔧 Local Heuristics Only (AI_PROVIDER=local)

Zero API calls. Scenes are scored purely on GPU-computed signals:

Audio RMS — Loudness spikes (explosions, crowd reactions, voice peaks).
Spectral Flux — Sudden frequency changes (gunshots, impacts, glass breaking).
Visual Motion — Pixel-diff action scoring via GPU-accelerated grayscale diffing.

All three signals are computed in a single pass using PyTorch on GPU. Scenes are ranked by a combined 0.6 × Audio (RMS + Flux) + 0.4 × Visual Motion score. Fast, free, and surprisingly effective for high-action content — but blind to context (it can't tell a celebration from a firefight).

🖼️ OpenAI Vision (AI_PROVIDER=openai)

Heuristics first narrow the field using Smart Selection (70% top scores + 30% random exploration), then candidates are sent to OpenAI. OpenAI's API doesn't accept video, so we extract 8 keyframe JPEGs per clip:

# Extract 8 static frames as base64 JPEGs
cmd = ["ffmpeg", "-i", clip_path, "-vf", "fps=1", "-frames:v", "8", ...]
# Send as image_url content to GPT-4o
content.append({"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{frame}"}})

The AI scores each clip across 7 semantic categories (action, funny, clutch, wtf, epic_fail, hype, skill). Good accuracy from static frames alone, but it can't hear audio and misses motion-dependent moments like glitches or physics bugs.

🎬 Gemini Per-Clip (AI_PROVIDER=gemini)

Uses the same Smart Selection (mixing high-heuristic clips with random segments for diversity), but uploads each candidate as actual video (downscaled to 640px wide). Gemini sees motion, timing, and audio:

# Each candidate clip: 640p downscaled, ~30-60s, uploaded as MP4
video_file = client.files.upload(file=clip_data, config={"mime_type": "video/mp4"})
response = client.models.generate_content(model="gemini-3-flash", contents=[video_file, prompt])

Significantly better at detecting funny, wtf, and clutch moments that depend on temporal context. Clips are analyzed in parallel (3 concurrent threads) to keep latency manageable.

🧠 Gemini Deep Analysis (GEMINI_DEEP_ANALYSIS=true)

The nuclear option. Instead of pre-filtering with heuristics then analyzing clips, Deep Analysis lets Gemini see the entire video — but not the raw multi-GB 4K file. A GPU-accelerated proxy is generated first:

4K @ 60fps → 640p @ 1fps, QP 35, mono 32kbps audio
~30GB gameplay recording -> ~15MB proxy

Gemini watches the whole thing and returns timestamped moments with categories and scores. No heuristic bias, no missed context. The AI finds narrative arcs — the buildup before a clutch play, the reaction after an epic fail — that clip-by-clip analysis simply can't detect.

Deep Analysis moments are scored with a +200 bias to ensure they rank above any heuristic candidate. A few high-action heuristic backups are still included as safety net clips.

Comparison Summary (1-hour 4K gameplay)

Mode	Accuracy	Analysis Cost	Creative Cost*	Total Cost	Data Uploaded
Local Heuristics	⭐⭐	Free	Free (Whisper)	Free	0 bytes
OpenAI Vision	⭐⭐⭐	~\$0.15	~\$0.15	~\$0.30	~6MB
Gemini Per-Clip	⭐⭐⭐⭐	~\$0.08	~\$0.08	~\$0.16	~90MB
Gemini Deep Analysis	⭐⭐⭐⭐⭐	~\$0.05	~\$0.08	~\$0.13	~60MB

*Creative Cost: Includes AI caption generation (LLM API call) + Voiceover synthesized locally (Free).

The counterintuitive result: Deep Analysis is the most cost-effective mode because it replaces 15 individual analysis uploads with one optimized proxy upload, while still delivering superior context-aware detection.

Roadmap & Vision

AutoShorts works today as a local pipeline for content creators. But the underlying engine — scene detection, AI ranking, voice synthesis, smart cropping — is a general-purpose highlight extraction backend. Here's where this is heading:

🔮 What's Next

Phase	Feature	Status
v2.1	Universal Video Type Support (Podcasts, Sports, Entertainment, etc.)	🔜 Planned
v2.2	SFX generation — AI-generated sound effects matched to on-screen action	🔜 Planned
v2.3	Cloud API mode (submit video URL → get clips back)	📐 Designing
v3.0	Live stream monitoring (detect highlights in real-time)	🔬 Research
v3.x	Multi-platform auto-upload (TikTok, YouTube Shorts, Reels)	📋 Backlog

🎮 Platform Integration Potential

The most exciting future isn't AutoShorts as a standalone tool — it's AutoShorts as a backend engine embedded in platforms millions of gamers already use:

Microsoft Xbox Game Bar — The overlay already captures screenshots and gameplay recordings (Win+G). Imagine a "Generate Highlights" button that takes your captured footage and produces ready-to-share clips with captions and voiceover — without ever leaving the overlay.
NVIDIA ShadowPlay — ShadowPlay's Instant Replay already silently records the last 30 seconds to 20 minutes of gameplay. Pair that buffer with AutoShorts' AI ranking, and ShadowPlay could automatically identify and export your best moments with professional-grade overlays and narration. No scrubbing through footage. No editing. Just play.
Discord Activity Integration — Post-session highlight reels generated from screen shares, dropped directly into your server channel.

The core thesis: highlight detection + voice synthesis + smart cropping is infrastructure, not an app. Every platform that captures gameplay footage could use this engine to turn passive recording into active content creation.

The best highlight reel is the one you never had to make.

Acknowledgements

This project builds upon:

artryazanov/shorts-maker-gpu — GPU-accelerated clip extraction using heuristic scoring (audio dB + motion detection).
Binary-Bytes/Auto-YouTube-Shorts-Maker — Original concept and inspiration for the automated short-form content pipeline.
Qwen3-TTS — Voice synthesis with natural language design
PyCaps — Animated subtitle rendering

Key Improvements Over Base Project

Feature	Base Project	AutoShorts
Architecture	Monolithic script	Modular package with lifecycle management
Scene Scoring	Audio dB + motion only	Hybrid: heuristics + Vision AI semantic analysis
Deep Analysis	N/A	Full-video Gemini analysis for context-aware detection
Voiceover	None	Qwen3-TTS with style-adaptive voice design
Captions	None	AI-generated, 10+ styles including story modes
CJK Support	N/A	Character-based subtitle chunking for JCK languages
Memory	Single model	VRAM-aware model sequencing (unload between phases)
TTS Sync	N/A	Per-sentence TTS generation for accurate timing
Overflow Handling	N/A	Re-render clips when TTS > video length

Try It Yourself


# Clone the repository
git clone https://github.com/divyaprakash0426/autoshorts.git
cd autoshorts

# Setup environment variables
cp .env.example .env
# Edit .env and add your API keys (Gemini/OpenAI) 

# Option 1: Using Makefile (Recommended)

make install

# Option 2: Using Shell Script
./install.sh

# Drop videos in gameplay/, then run:
./.venv/bin/python run.py

Or launch the dashboard:

./.venv/bin/streamlit run src/dashboard/About.py

🛡️ Battle Tested On

Asus Zephyrus G16 (RTX 4080 Mobile, Intel Ultra 9) running Arch Linux.

Built with frustration, caffeine, and GitHub Copilot CLI.

DEV Community