If you've ever watched a long-form YouTube video and thought "this 45-second segment would kill on Shorts," you already understand the core problem this pipeline solves. Manually trimming, reformatting, and uploading clips is brutal at scale. Let's build a Node.js pipeline that automates the entire process — download, detect, crop, encode, and output a vertical-ready clip.
This is the kind of automation that powers tools like ClipSpeedAI, which does all of this with AI-driven clip selection on top.
The Pipeline Overview
- Download the YouTube video with
yt-dlp - Extract a time segment with FFmpeg
- Detect the "action region" (or use center-crop as a fallback)
- Re-encode to 9:16 vertical at 1080x1920
- Write the output file
Prerequisites
npm install fluent-ffmpeg execa
pip install yt-dlp
Make sure ffmpeg and ffprobe are on your PATH. On Ubuntu:
sudo apt install ffmpeg
Step 1: Download With yt-dlp
// downloader.js
import { execa } from 'execa';
import path from 'path';
export async function downloadVideo(youtubeUrl, outputDir) {
const outputTemplate = path.join(outputDir, '%(id)s.%(ext)s');
const { stdout } = await execa('yt-dlp', [
'--format', 'bestvideo[height<=1080][ext=mp4]+bestaudio[ext=m4a]/best[height<=1080]',
'--merge-output-format', 'mp4',
'--output', outputTemplate,
'--print', 'filename',
youtubeUrl
]);
return stdout.trim();
}
One important note: never route video downloads through proxies. The files are large and proxy bandwidth is expensive. Only proxy the metadata/info-json API calls if you need to avoid rate limits.
Step 2: Extract a Segment
// clipper.js
import ffmpeg from 'fluent-ffmpeg';
export function extractSegment(inputPath, outputPath, startTime, duration) {
return new Promise((resolve, reject) => {
ffmpeg(inputPath)
.seekInput(startTime)
.duration(duration)
.outputOptions(['-c:v libx264', '-c:a aac', '-avoid_negative_ts make_zero'])
.output(outputPath)
.on('end', resolve)
.on('error', reject)
.run();
});
}
Using .seekInput() before the input (input seeking) is much faster than output seeking because FFmpeg skips the packet decode entirely until it hits the target timestamp.
Step 3: Crop to 9:16
Here's where it gets interesting. For a 1920x1080 source, a 9:16 crop at full height would be 607x1080. But for Shorts, we want 1080x1920 — so we need to scale up.
// cropper.js
import ffmpeg from 'fluent-ffmpeg';
export function cropToVertical(inputPath, outputPath, cropX = null) {
// For 1080p source: crop 607px wide, centered or at cropX
const sourceWidth = 1920;
const sourceHeight = 1080;
const cropWidth = Math.floor(sourceHeight * (9 / 16)); // 607
const x = cropX !== null ? cropX : Math.floor((sourceWidth - cropWidth) / 2);
return new Promise((resolve, reject) => {
ffmpeg(inputPath)
.videoFilter([
`crop=${cropWidth}:${sourceHeight}:${x}:0`,
`scale=1080:1920:flags=lanczos`
])
.outputOptions([
'-c:v libx264',
'-preset fast',
'-crf 23',
'-c:a aac',
'-b:a 128k',
'-movflags +faststart'
])
.output(outputPath)
.on('end', resolve)
.on('error', reject)
.run();
});
}
The -movflags +faststart flag moves the moov atom to the front of the file, which is essential for streaming and preview loading in browser players.
Step 4: Wire It Together
// pipeline.js
import { downloadVideo } from './downloader.js';
import { extractSegment } from './clipper.js';
import { cropToVertical } from './cropper.js';
import path from 'path';
import fs from 'fs';
const TMP = '/tmp/clips';
fs.mkdirSync(TMP, { recursive: true });
async function processYouTubeToShort(youtubeUrl, startTime, duration, cropX = null) {
console.log('Downloading...');
const sourcePath = await downloadVideo(youtubeUrl, TMP);
console.log('Extracting segment...');
const segmentPath = path.join(TMP, `segment_${Date.now()}.mp4`);
await extractSegment(sourcePath, segmentPath, startTime, duration);
console.log('Cropping to vertical...');
const outputPath = path.join(TMP, `short_${Date.now()}.mp4`);
await cropToVertical(segmentPath, outputPath, cropX);
// Cleanup segment
fs.unlinkSync(segmentPath);
console.log(`Done: ${outputPath}`);
return outputPath;
}
// Example usage
processYouTubeToShort(
'https://www.youtube.com/watch?v=dQw4w9WgXcQ',
'00:01:24',
45,
700 // crop starting at x=700
);
Adding Smart Crop Detection
For a basic center-crop fallback, what we have is fine. For intelligent crop detection — like following a speaker's face — you need a secondary analysis pass. That's where integrating something like MediaPipe or a frame-by-frame face detection step comes in.
The general pattern is: run face detection on a keyframe every N seconds, collect the bounding box centroids, then compute the median X position across the clip. This gives you a stable crop X that doesn't jitter.
async function getStableCropX(videoPath, fps = 1) {
// Extract keyframes at 1fps to /tmp/frames/
// Run face detection on each frame
// Return median face center X, scaled to source resolution
// (implementation depends on your detection model)
}
Tools like ClipSpeedAI handle this detection and crop targeting automatically, which is the production-grade version of what we've built here.
Performance Notes
- Input seeking (
seekInputbefore input) is 3-10x faster than output seeking for long videos -
preset fastvspreset slow: about 2x speed difference with minimal quality delta at CRF 23 - For batch jobs, use a queue (Bull + Redis) rather than running these concurrently — FFmpeg is CPU-bound and concurrent jobs will thrash each other
What's Next
This pipeline is the foundation. From here you can layer in:
- GPT-4o-based clip scoring to find the best segments automatically
- Whisper-based caption burning for caption overlays
- A job queue for processing dozens of videos in parallel
If you want to skip building all of this yourself, ClipSpeedAI wraps the entire pipeline into a hosted API — worth checking out if you're building on top of YouTube content at scale.
The full code above is production-ready for single-file processing. Wire it into a Bull queue and you've got a scalable YouTube Shorts factory.
Top comments (0)