I built a video transcription feature into my side project — a free video downloader called Videolyti. Here's how I wired up OpenAI's Whisper model to transcribe downloaded videos on the server side, what worked, what didn't, and what I'd do differently.
Why Server-Side Transcription?
Most transcription tools either charge per minute of audio or require you to upload files to some third-party API. I wanted something that runs on my own hardware, costs nothing per request, and integrates directly with the download pipeline.
OpenAI Whisper was the obvious choice. It's open source, handles 90+ languages, and the accuracy on the large-v3 model is genuinely impressive — even with background noise and accented speech.
The Architecture
The stack is straightforward:
- Express 5 backend with Socket.IO for real-time progress updates
- yt-dlp handles video downloading from YouTube, TikTok, Instagram, etc.
- ffprobe extracts audio duration metadata
- Whisper CLI runs the actual transcription
The flow: user submits a URL, backend downloads the video with yt-dlp, then if the user wants a transcript, Whisper processes the audio file and returns timestamped segments.
URL → yt-dlp download → ffprobe (duration) → Whisper CLI → JSON segments
Spawning Whisper from Node.js
Whisper runs as a Python CLI tool, so I spawn it as a child process:
import { spawn } from 'child_process';
const args = [
audioPath,
'--output_dir', transcriptsDir,
'--output_format', 'json',
'--model', 'base',
'--verbose', 'True',
];
if (language !== 'auto') {
args.push('--language', language);
}
const whisper = spawn('/path/to/whisper', args);
Whisper writes a JSON file with this structure:
{
"text": "full transcript text here",
"segments": [
{ "start": 0.0, "end": 5.04, "text": "First sentence here" },
{ "start": 5.04, "end": 10.2, "text": "Second sentence" }
],
"language": "en"
}
Each segment includes start/end timestamps — useful for building subtitle files or jump-to-timestamp features.
The Progress Problem
Whisper doesn't stream progress. It processes the whole file, then writes output. For a 3-minute video, that could mean 30+ seconds of dead silence where the user stares at a spinner.
My workaround: estimate completion time based on audio duration, then send simulated progress updates over Socket.IO.
const estimatedTime = Math.max(duration * 2, 10);
let currentProgress = 0;
const progressInterval = setInterval(() => {
if (currentProgress < 95) {
currentProgress += Math.random() * 5 + 2;
currentProgress = Math.min(currentProgress, 95);
emitProgress({
jobId,
stage: 'transcribe',
progress: Math.round(currentProgress),
message: `Transcribing... ${Math.round(currentProgress)}%`,
});
}
}, estimatedTime * 10);
Is this elegant? No. Does it work? Users stop hitting refresh, so yes.
Model Selection Trade-offs
Whisper comes in several sizes. Here's what I found in practice:
| Model | Speed (3 min audio) | Accuracy | VRAM |
|---|---|---|---|
| tiny | ~5 sec | Decent for clear speech | ~1 GB |
| base | ~10 sec | Solid for most content | ~1 GB |
| small | ~30 sec | Noticeably better | ~2 GB |
| medium | ~90 sec | Great | ~5 GB |
| large-v3 | ~3 min | Best available | ~10 GB |
I default to base for the free tier. Fast enough that users don't abandon the page, accurate enough for most use cases. The tiny model occasionally garbles words, especially in non-English content.
For production I'd recommend base as default with an option to bump up to small or medium when accuracy matters more than speed.
Handling Failures
Whisper can fail in ways that aren't immediately obvious:
Corrupted audio — yt-dlp sometimes produces files that ffmpeg can decode but Whisper chokes on. I added a pre-check using ffprobe to validate the audio stream before sending it to Whisper.
Memory limits — the large-v3 model needs ~10GB VRAM. On a machine with 8GB, it silently falls back to CPU and takes 10x longer. Set explicit timeouts or your users will be waiting forever.
Language detection hiccups — Whisper's auto-detect usually works, but it sometimes confuses similar languages (Ukrainian vs Russian, Spanish vs Portuguese). Letting users pick the language themselves fixes this.
My timeout sits at 30 minutes — generous, but some long videos with the medium model genuinely need that much time on CPU.
What I'd Change Now
Use faster-whisper. The CTranslate2-based implementation is 4x faster with the same accuracy. I started with the vanilla OpenAI CLI for simplicity but would switch for any serious deployment.
Add a job queue. Right now transcription runs inline. Two concurrent jobs on the same GPU will both slow to a crawl. Bull or BullMQ with proper concurrency limits would fix this.
Cache transcripts. Same video URL should return cached results instead of re-processing. Storage is cheap, CPU time isn't.
Try It Yourself
If you want to see this in action, Videolyti lets you download a video and transcribe it in one go. Paste a YouTube or TikTok link, download, hit transcribe, get timestamped text back.
No signup, no payment. Just works.
Building something similar? I'm happy to talk through the Socket.IO progress tracking or error handling in more detail — drop a comment or reach out.
Top comments (0)