Picking which 60 seconds of a 30-minute YouTube video will go viral is a hard problem. Human editors spend years developing the intuition for it. I spent a few weeks trying to compress that intuition into a GPT-4o prompt — and it actually works well enough for production use.
This article walks through the exact prompt engineering approach, the scoring schema, and the Node.js implementation I use in the pipeline behind ClipSpeedAI.
The Problem With Naive Clip Selection
The obvious approach — "find the loudest moment" or "pick the segment with the most cuts" — doesn't work well. Viral clips share subtler characteristics:
- A clear hook in the first 3 seconds
- An emotional peak (laugh, surprise, revelation)
- Self-contained narrative (viewer doesn't need prior context)
- Speaker is visually prominent and engaged
You can't detect most of these with audio analysis alone. You need semantic understanding of what's being said and shown.
The Approach: Transcript + Frame Analysis
The pipeline works in two passes:
- Transcript pass: Whisper transcribes the full video. GPT-4o reads the transcript in chunks and scores each segment for virality potential.
- Frame pass (optional): For top-scoring segments, extract a keyframe and send it to GPT-4o Vision to verify the visual quality.
Step 1: Generate the Transcript With Timestamps
// transcribe.js
import OpenAI from 'openai';
import fs from 'fs';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export async function transcribeVideo(audioPath) {
const response = await openai.audio.transcriptions.create({
file: fs.createReadStream(audioPath),
model: 'whisper-1',
response_format: 'verbose_json',
timestamp_granularities: ['segment']
});
return response.segments.map(seg => ({
start: seg.start,
end: seg.end,
text: seg.text.trim()
}));
}
This gives you an array of transcript segments with precise timestamps — the foundation for everything else.
Step 2: Chunk and Score With GPT-4o
// scorer.js
import OpenAI from 'openai';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const SYSTEM_PROMPT = `You are a social media video editor specializing in YouTube Shorts and TikTok.
You evaluate transcript segments for their potential to go viral as standalone 30-90 second clips.
Score each segment on these dimensions (0-10 each):
- hook_strength: Does it open with something immediately compelling?
- emotional_peak: Is there a laugh, surprise, revelation, or strong reaction?
- self_contained: Can someone understand this clip without prior context?
- speaker_energy: Does the language suggest high energy or animated delivery?
- quotability: Is there a memorable line or punchline?
Return a JSON array. Each object must have: start, end, hook_strength, emotional_peak,
self_contained, speaker_energy, quotability, composite_score, reason.
composite_score = weighted average: hook(30%) + emotional_peak(25%) + self_contained(20%) + speaker_energy(15%) + quotability(10%)`;
export async function scoreTranscriptChunks(segments) {
// Group into ~90-second windows with overlap
const windows = buildWindows(segments, 90, 15);
const results = [];
for (const window of windows) {
const transcriptText = window.segments
.map(s => `[${s.start.toFixed(1)}s] ${s.text}`)
.join('\n');
const response = await openai.chat.completions.create({
model: 'gpt-4o',
response_format: { type: 'json_object' },
messages: [
{ role: 'system', content: SYSTEM_PROMPT },
{ role: 'user', content: `Score this transcript window:\n\n${transcriptText}` }
],
temperature: 0.3
});
const parsed = JSON.parse(response.choices[0].message.content);
results.push(...(parsed.segments || []));
}
return results.sort((a, b) => b.composite_score - a.composite_score);
}
function buildWindows(segments, windowSecs, overlapSecs) {
const windows = [];
let i = 0;
while (i < segments.length) {
const windowStart = segments[i].start;
const windowEnd = windowStart + windowSecs;
const windowSegs = segments.filter(s => s.start >= windowStart && s.end <= windowEnd);
if (windowSegs.length > 0) {
windows.push({ start: windowStart, end: windowEnd, segments: windowSegs });
}
// Advance by (window - overlap)
const nextStart = windowStart + windowSecs - overlapSecs;
i = segments.findIndex(s => s.start >= nextStart);
if (i === -1) break;
}
return windows;
}
Step 3: The Prompt Engineering Details
The prompt design choices that made a real difference:
Temperature 0.3, not 0. Pure determinism (temp 0) causes GPT-4o to be conservative — everything scores 6-7, nothing stands out. At 0.3, you get meaningful differentiation.
Explicit weights in the system prompt. Telling the model the composite score formula means it internalizes the weighting rather than averaging naively.
[timestamp] prefixes in the transcript. This anchors the model to actual time positions rather than relative positions, which matters when you need to map scores back to video timestamps.
JSON response_format. Always use response_format: { type: 'json_object' } for structured output. It eliminates markdown code fences and parse failures.
Step 4: Visual Verification Pass
For the top 3 candidates, extract a keyframe and verify visually:
// visual_check.js
import { execa } from 'execa';
import OpenAI from 'openai';
import fs from 'fs';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export async function verifyFrameQuality(videoPath, timestamp) {
const framePath = `/tmp/frame_${Date.now()}.jpg`;
await execa('ffmpeg', [
'-ss', String(timestamp),
'-i', videoPath,
'-frames:v', '1',
'-q:v', '2',
framePath
]);
const imageData = fs.readFileSync(framePath).toString('base64');
fs.unlinkSync(framePath);
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: [
{
type: 'image_url',
image_url: { url: `data:image/jpeg;base64,${imageData}` }
},
{
type: 'text',
text: 'Rate this frame 1-10 for: face_visibility, lighting_quality, frame_composition. Return JSON.'
}
]
}],
response_format: { type: 'json_object' }
});
return JSON.parse(response.choices[0].message.content);
}
Scoring Results in Practice
In real testing against YouTube interview content, GPT-4o consistently identifies:
- Story punchlines and reveal moments
- Moments where the speaker interrupts themselves with excitement
- Strong opinion statements ("The industry is completely wrong about X")
- Counterintuitive claims that create curiosity gaps
Where it struggles: purely visual humor, reaction shots without dialogue, and clips where the context is heavy jargon it doesn't have training data for.
Integration With the Full Pipeline
The scorer fits into the pipeline like this:
download → transcribe → score → select top N → extract + crop → output
ClipSpeedAI runs this exact flow on uploaded or linked YouTube videos, with some additional heuristics layered on top for things like avoiding mid-sentence cuts and enforcing minimum clip quality thresholds.
What to Try Next
- Experiment with different window sizes. 90 seconds works well for interviews; 45 seconds is better for fast-paced educational content.
- Add a "category" parameter to the prompt so the scoring criteria shifts based on content type (comedy vs. tutorial vs. interview).
- Cache transcript results in Redis so you don't re-transcribe the same video twice.
The full scoring system running at ClipSpeedAI has been refined over hundreds of test videos. The version here is the minimal viable implementation — enough to outperform random clip selection significantly.
Top comments (0)