I Built an AI That Turns Raw Gameplay into TikTok, YouTube, and Trailer Clips — Automatically

#ai #python #gaming #opensource

Every time I watch a stream, I think the same thing: there are 10 incredible moments buried in 3 hours of footage that no one will ever see.

Editing is the bottleneck. It takes skill, time, and software most people don't want to learn. So the best gaming moments just... disappear.

I built clipforge to fix that. Upload a gameplay video. Get three ready-to-post formats back — automatically.

What it produces

From a single upload (mp4, mov, or mkv — up to 10 minutes):

TikTok/Reels clip — 60 seconds max, vertical 9:16 crop, auto-captions via Whisper
YouTube highlight reel — top moments sequenced, up to 10 minutes
Cinematic trailer — 90 seconds, fast cuts + slow-mo climax on the best moment

All three come back as a single ZIP download.

How the pipeline works

Upload video
     ↓
Scene detection (PySceneDetect)
     ↓
Highlight scoring (librosa RMS energy)
     ↓
Clip selection (top N by score)
     ↓
Format assembly (moviepy)
  ├── TikTok: best moment, vertical crop, captions
  ├── YouTube: top moments concatenated
  └── Trailer: fast cuts + slow-mo climax
     ↓
ZIP download

Scene detection

PySceneDetect finds boundaries where the video content changes significantly — cut to a new location, a killcam, a respawn screen. Each boundary becomes a candidate scene.

Highlight scoring

For each scene, I extract the audio with librosa and compute the mean RMS energy. This is a simple but effective proxy for excitement: explosions, clutch moments, commentary peaks all produce louder audio than menu screens or downtime.

def _rms_score(y, sr, start, end):
    segment = y[int(start * sr):int(end * sr)]
    rms = librosa.feature.rms(y=segment, frame_length=2048, hop_length=512)
    return float(np.mean(rms))

Scenes are ranked by score. The top 10 go to the assembler.

Format assembly

moviepy handles the actual video cutting. The TikTok path crops to 9:16 and adds caption overlays. The trailer path applies 0.5x slow-mo to the highest-scoring moment and stacks fast cuts before it.

Whisper runs locally on the TikTok segment to generate captions — no API key, no upload, no cost per use.

The stack

Layer	Tech
Backend	Python, FastAPI, BackgroundTasks
Scene detection	PySceneDetect
Audio analysis	librosa
Video editing	moviepy
Captions	OpenAI Whisper (local)
Frontend	Next.js 15, Tailwind CSS
Deploy	Railway (backend) + Vercel (frontend)

What I learned building it

RMS energy is a surprisingly good highlight detector. I expected to need something more sophisticated — computer vision, game event detection, kill feed parsing. But audio alone gets you 80% of the way there. Exciting moments in games are almost always loud moments.

PySceneDetect is fast and battle-tested. I considered writing my own frame differencing logic. I'm glad I didn't. The library handles edge cases (fades, flashes, black frames) that would have taken weeks to debug myself.

Whisper on a 60-second clip is fast enough. I expected local transcription to be a bottleneck. On a modern machine with the base model, a 60-second clip transcribes in under 10 seconds. Good enough for v1.

moviepy's resource management requires care. VideoFileClip objects need explicit .close() calls or you'll leak file handles and temp files across the process lifetime. I wrapped the source in a try/finally block after the first review caught a resource leak on exception paths.

Try it

Code: LakshmiSravyaVedantham/clipforge

git clone https://github.com/LakshmiSravyaVedantham/clipforge
cd clipforge/backend
pip install -r requirements.txt
uvicorn main:app --reload

Open the frontend separately:

cd frontend && npm install && npm run dev

The most common gaming moments I see not getting shared aren't the planned ones. They're the accidental clutches, the absurd bug moments, the 1-in-1000 shots nobody was recording for. clipforge is built for those.