FFmpeg's default settings are designed for output quality, not processing speed. On a production server where you're processing dozens of videos per hour, the difference between the right settings and the defaults can be the difference between a 60-second encode and a 100-second encode — at scale, that's the difference between serving 60 jobs/hour or 36 jobs/hour on the same hardware.
This post covers the exact FFmpeg settings that cut processing time by 40% in the pipeline at ClipSpeedAI, with benchmarks and the reasoning behind each change.
The Baseline
Default FFmpeg encode for a 60-second 1080p clip to 1080x1920 vertical:
ffmpeg -i input.mp4 -vf "crop=607:1080:660:0,scale=1080:1920" \
-c:v libx264 -c:a aac output.mp4
Time: ~85 seconds on a 2-core server.
Target: under 55 seconds with acceptable quality loss.
Optimization 1: Encoder Preset
The most impactful single setting. libx264's preset controls the speed/compression tradeoff:
# Default (no preset specified) = medium
# Slow: best compression, slowest encode
# Medium: balanced default
# Fast: ~30% faster, ~5% larger file
# Faster: ~50% faster, ~10% larger file
# Veryfast: ~65% faster, ~15% larger file
For short-form video clips (30-90 seconds), file size isn't a significant concern. A 5-10MB clip vs a 7-12MB clip doesn't matter to the end user.
ffmpeg -i input.mp4 \
-vf "crop=607:1080:660:0,scale=1080:1920" \
-c:v libx264 \
-preset fast \ # was: medium (default)
-crf 23 \
-c:a aac output.mp4
Improvement: ~28% faster (85s → 61s)
Optimization 2: Input Seeking
Placing -ss before -i vs after -i makes a massive difference for segment extraction:
# SLOW: output seeking (decodes everything from start)
ffmpeg -i input.mp4 -ss 45 -t 60 output.mp4
# FAST: input seeking (jumps to timestamp before decoding)
ffmpeg -ss 45 -i input.mp4 -t 60 output.mp4
Output seeking decodes every frame from 0 to the seek point. For a 30-minute video with a clip starting at minute 25, that's 25 minutes of decoding you're throwing away.
Input seeking isn't pixel-perfect (seeks to the nearest keyframe), but for most content the difference is imperceptible.
Improvement: Up to 10x faster for long source videos
// Node.js: always put -ss before -i
await execa('ffmpeg', [
'-ss', String(startTime), // BEFORE -i
'-i', videoPath,
'-t', String(duration),
'-c:v', 'libx264',
'-preset', 'fast',
'-crf', '23',
output
]);
Optimization 3: Audio Copy vs Re-encode
If the source audio is already AAC (which it is for most YouTube-downloaded MP4s), copy it instead of re-encoding:
# Re-encoding audio (unnecessary for AAC→AAC)
-c:a aac -b:a 128k
# Copy audio stream directly
-c:a copy
Caveat: Only use -c:a copy when the output container supports the source audio codec. MP4 container + AAC audio = always fine. If you're burning captions with the ASS filter, you must still copy audio (the filter only touches the video stream).
Improvement: ~5-8% faster, plus slightly better audio quality (no re-encode loss).
Optimization 4: Scale Filter with Faster Algorithm
The scale filter's default algorithm (bicubic) is slower than lanczos for upscaling and much slower than bilinear for downscaling:
# Default (bicubic)
scale=1080:1920
# For upscaling (lanczos is sharper but slower than bicubic)
scale=1080:1920:flags=lanczos
# For downscaling (bilinear is fast with minimal quality loss)
scale=1080:1920:flags=bilinear
For vertical video reframing from 1080p sources, you're upscaling a 607px-wide crop to 1080px — use lanczos for quality. For thumbnail generation where you're downscaling, use bilinear for speed.
Improvement: 3-7% depending on content complexity
Optimization 5: Disable Unused FFmpeg Features
FFmpeg processes streams it doesn't need by default. Explicitly disable:
# If you only want video (no audio processing):
-an
# If you only want audio:
-vn
# For image/thumbnail extraction:
-frames:v 1 # stop after 1 frame
For the segment extraction step (before caption burning), extract without re-encoding video at all — just cut:
# Ultra-fast segment extraction (no re-encode)
ffmpeg -ss 45 -i input.mp4 -t 60 -c copy segment.mp4
-c copy passes through both video and audio streams without re-encoding. This takes ~2 seconds regardless of segment length, vs 30-60 seconds for a full re-encode. The trade-off: you can only cut at keyframe boundaries, so the start point may be off by up to the keyframe interval (~2 seconds for most YouTube content).
For two-pass pipelines (extract then crop), this is the right approach — fast extract with -c copy, then a clean encode on the shorter segment.
Optimization 6: faststart Flag
Not a speed optimization for encoding, but critical for streaming performance:
-movflags +faststart
This moves the moov atom to the beginning of the MP4 file. Without it, a browser has to download the entire file before it can start playing. With it, playback starts immediately.
Required for any clip that users will preview or play in a web browser before downloading.
The Optimized Full Command
Combining all optimizations:
// lib/ffmpeg/encode_clip.js
import { execa } from 'execa';
export async function encodeVerticalClip({
inputPath,
outputPath,
startTime,
duration,
cropX,
sourceWidth = 1920,
sourceHeight = 1080,
assSubtitlePath = null
}) {
const cropWidth = Math.floor(sourceHeight * (9 / 16));
const videoFilters = [
`crop=${cropWidth}:${sourceHeight}:${cropX}:0`,
'scale=1080:1920:flags=lanczos'
];
if (assSubtitlePath) {
videoFilters.push(`ass=${assSubtitlePath}`);
}
const args = [
'-ss', String(startTime), // input seeking
'-i', inputPath,
'-t', String(duration),
'-vf', videoFilters.join(','),
'-c:v', 'libx264',
'-preset', 'fast',
'-crf', '23',
'-c:a', 'copy', // copy audio, don't re-encode
'-movflags', '+faststart',
'-avoid_negative_ts', 'make_zero',
'-y',
outputPath
];
const start = Date.now();
await execa('ffmpeg', args);
const elapsed = Date.now() - start;
console.log(`Encoded ${duration}s clip in ${elapsed}ms`);
return outputPath;
}
Benchmark Summary
| Setting | Change | Time Saved |
|---|---|---|
| preset: fast | was: medium | ~28% |
| Input seeking | was: output seeking | Up to 10x for long videos |
-c:a copy |
was: re-encode AAC | ~6% |
scale:flags=lanczos |
was: default | ~4% |
| Combined | ~40% total |
Before optimizations: 85s average for a 60-second clip on a 2-core server.
After optimizations: 51s average.
At 10 clips per job and 20 jobs/hour, that's 5,600 seconds/hour of compute saved — enough to meaningfully reduce server costs or increase throughput at the same cost.
ClipSpeedAI processes every clip with these settings. The full encode pipeline described here processes a 60-second clip in approximately 45-55 seconds including face detection, caption burning, and upload — a number that only became achievable with these FFmpeg optimizations in place.
For the upstream pipeline that feeds clips into this encoder — YouTube download, Whisper transcription, and GPT-4o scoring — check out the other articles in this series. If you'd rather use the fully-optimized hosted version than implement it yourself, ClipSpeedAI puts all of these optimizations to work for you automatically.
Top comments (0)