Subtitles are table stakes for modern video content — 85% of social video is watched without sound. But if you're a developer running a video pipeline, you need to think beyond "just upload to YouTube" and start thinking about programmatic subtitle generation, SRT formatting, and clean ffmpeg burn-in workflows.
This post walks through the technical stack behind free subtitle generation: what Whisper actually does under the hood, how SRT files are structured, and how to burn accurate captions into video clips with ffmpeg. We'll also look at where ClipSpeedAI fits when you're building for creators at scale.
Why Manual Tools Break at Volume
Tools like Amara and Veed.io are fine for one-off videos. But once you're generating 20-50 clips per day from long-form content — podcasts, livestreams, interviews — manual subtitling becomes a bottleneck. The solution is a pipeline:
Audio extraction → ASR transcription → Timestamp alignment → SRT generation → ffmpeg burn-in
Each stage can be automated. Let's break them down.
Stage 1: Audio Extraction with ffmpeg
Before transcribing, you need clean mono audio at the right sample rate:
ffmpeg -i input_video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 output_audio.wav
The -ar 16000 flag matters — Whisper was trained on 16kHz audio. Passing higher sample rates works but adds unnecessary compute overhead on the transcription side.
Stage 2: Transcription with OpenAI Whisper
Whisper is an encoder-decoder transformer trained on 680,000 hours of multilingual audio. Its strength over traditional ASR is robustness to accented speech, background noise, and domain-specific vocabulary.
import whisper
model = whisper.load_model("base") # or small/medium/large
result = model.transcribe("output_audio.wav", word_timestamps=True)
for segment in result["segments"]:
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
word_timestamps=True gives per-word timing — critical for SRT files where each caption appears and disappears at exactly the right moment. Whisper's base model runs at ~32x realtime on CPU, meaning a 60-second clip transcribes in under 2 seconds.
Stage 3: Building the SRT File
SRT is simple: sequence number, timestamp range, text.
def segments_to_srt(segments, max_chars_per_line=42):
srt_lines = []
for i, seg in enumerate(segments, 1):
start = fmt_ts(seg['start'])
end = fmt_ts(seg['end'])
text = seg['text'].strip()
if len(text) > max_chars_per_line:
mid = text.rfind(' ', 0, len(text) // 2)
text = text[:mid] + '\n' + text[mid+1:]
srt_lines.append(f"{i}\n{start} --> {end}\n{text}\n")
return "\n".join(srt_lines)
def fmt_ts(s):
h, m = int(s // 3600), int((s % 3600) // 60)
sec, ms = int(s % 60), int((s % 1) * 1000)
return f"{h:02d}:{m:02d}:{sec:02d},{ms:03d}"
Stage 4: Burning Subtitles into Video
For social media clips, hard subtitles (baked into pixels) are required since most platforms strip soft subtitle tracks:
ffmpeg -i input_video.mp4 \
-vf "subtitles=captions.srt:force_style='FontName=Arial,FontSize=22,PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,Outline=2,Bold=1,Alignment=2'" \
-c:a copy \
output_with_captions.mp4
For vertical (9:16) clips, push captions up so they don't get buried by UI chrome:
-vf "subtitles=captions.srt:force_style='Alignment=2,MarginV=80'"
Production Architecture for Scale
At volume, you need a queue:
Job queue (Redis/BullMQ) → Worker pool →
Whisper transcription → SRT assembly → ffmpeg render →
Object storage (S3/R2)
With base Whisper on a 4-core CPU machine, you can process ~40-50 minutes of video per hour. The GPU path with medium Whisper on CUDA gets 10-20x that throughput.
Whisper Failure Modes to Know
- Hallucination: Near-silent passages can trigger fabricated text. Detect by comparing audio energy RMS to transcription density.
- Speaker overlap: Whisper merges overlapping speech. Fix: pyannote.audio diarization before passing to Whisper.
-
Brand/proper nouns: Use
initial_promptto prime the model with context vocabulary.
Low-confidence filtering:
for seg in result["segments"]:
if seg.get("avg_logprob", 0) < -1.0:
continue # flag for review
ClipSpeedAI's Production Stack
ClipSpeedAI handles the full subtitle pipeline — Whisper transcription, animated caption rendering, and per-frame speaker tracking so captions stay readable regardless of camera cuts. Check out the ClipSpeedAI feature set for details on the animated caption renderer and vertical-format optimization.
The original breakdown is on the ClipSpeedAI blog.
Summary
Free subtitle generation is fully achievable with Whisper + Python + ffmpeg. The engineering challenges are around accuracy edge cases, speaker overlap, and rendering quality for vertical clips. For teams that want to skip building this infrastructure themselves, ClipSpeedAI runs this pipeline at scale.
Try ClipSpeedAI free — no card required.
Top comments (0)