Klark Visios

Posted on Mar 1

Building AI Video Transcription with OpenAI Whisper

#openai #whisper #node #typescript

I built a video transcription feature into my side project — a free video downloader called Videolyti. Here's how I wired up OpenAI's Whisper model to transcribe downloaded videos on the server side, what worked, what didn't, and what I'd do differently.

Why Server-Side Transcription?

Most transcription tools either charge per minute of audio or require you to upload files to some third-party API. I wanted something that runs on my own hardware, costs nothing per request, and integrates directly with the download pipeline.

OpenAI Whisper was the obvious choice. It's open source, handles 90+ languages, and the accuracy on the large-v3 model is genuinely impressive — even with background noise and accented speech.

The Architecture

The stack is straightforward:

Express 5 backend with Socket.IO for real-time progress updates
yt-dlp handles video downloading from YouTube, TikTok, Instagram, etc.
ffprobe extracts audio duration metadata
Whisper CLI runs the actual transcription

The flow: user submits a URL, backend downloads the video with yt-dlp, then if the user wants a transcript, Whisper processes the audio file and returns timestamped segments.

URL → yt-dlp download → ffprobe (duration) → Whisper CLI → JSON segments

Spawning Whisper from Node.js

Whisper runs as a Python CLI tool, so I spawn it as a child process:

import { spawn } from 'child_process';

const args = [
  audioPath,
  '--output_dir', transcriptsDir,
  '--output_format', 'json',
  '--model', 'base',
  '--verbose', 'True',
];

if (language !== 'auto') {
  args.push('--language', language);
}

const whisper = spawn('/path/to/whisper', args);

Whisper writes a JSON file with this structure:

{
  "text": "full transcript text here",
  "segments": [
    { "start": 0.0, "end": 5.04, "text": "First sentence here" },
    { "start": 5.04, "end": 10.2, "text": "Second sentence" }
  ],
  "language": "en"
}

Each segment includes start/end timestamps — useful for building subtitle files or jump-to-timestamp features.

The Progress Problem

Whisper doesn't stream progress. It processes the whole file, then writes output. For a 3-minute video, that could mean 30+ seconds of dead silence where the user stares at a spinner.

My workaround: estimate completion time based on audio duration, then send simulated progress updates over Socket.IO.

const estimatedTime = Math.max(duration * 2, 10);
let currentProgress = 0;

const progressInterval = setInterval(() => {
  if (currentProgress < 95) {
    currentProgress += Math.random() * 5 + 2;
    currentProgress = Math.min(currentProgress, 95);
    emitProgress({
      jobId,
      stage: 'transcribe',
      progress: Math.round(currentProgress),
      message: `Transcribing... ${Math.round(currentProgress)}%`,
    });
  }
}, estimatedTime * 10);

Is this elegant? No. Does it work? Users stop hitting refresh, so yes.

Model Selection Trade-offs

Whisper comes in several sizes. Here's what I found in practice:

Model	Speed (3 min audio)	Accuracy	VRAM
tiny	~5 sec	Decent for clear speech	~1 GB
base	~10 sec	Solid for most content	~1 GB
small	~30 sec	Noticeably better	~2 GB
medium	~90 sec	Great	~5 GB
large-v3	~3 min	Best available	~10 GB

I default to base for the free tier. Fast enough that users don't abandon the page, accurate enough for most use cases. The tiny model occasionally garbles words, especially in non-English content.

For production I'd recommend base as default with an option to bump up to small or medium when accuracy matters more than speed.

Handling Failures

Whisper can fail in ways that aren't immediately obvious:

Corrupted audio — yt-dlp sometimes produces files that ffmpeg can decode but Whisper chokes on. I added a pre-check using ffprobe to validate the audio stream before sending it to Whisper.

Memory limits — the large-v3 model needs ~10GB VRAM. On a machine with 8GB, it silently falls back to CPU and takes 10x longer. Set explicit timeouts or your users will be waiting forever.

Language detection hiccups — Whisper's auto-detect usually works, but it sometimes confuses similar languages (Ukrainian vs Russian, Spanish vs Portuguese). Letting users pick the language themselves fixes this.

My timeout sits at 30 minutes — generous, but some long videos with the medium model genuinely need that much time on CPU.

What I'd Change Now

Use faster-whisper. The CTranslate2-based implementation is 4x faster with the same accuracy. I started with the vanilla OpenAI CLI for simplicity but would switch for any serious deployment.

Add a job queue. Right now transcription runs inline. Two concurrent jobs on the same GPU will both slow to a crawl. Bull or BullMQ with proper concurrency limits would fix this.

Cache transcripts. Same video URL should return cached results instead of re-processing. Storage is cheap, CPU time isn't.

Try It Yourself

If you want to see this in action, Videolyti lets you download a video and transcribe it in one go. Paste a YouTube or TikTok link, download, hit transcribe, get timestamped text back.

No signup, no payment. Just works.

Building something similar? I'm happy to talk through the Socket.IO progress tracking or error handling in more detail — drop a comment or reach out.

DEV Community