DEV Community

zephyr zheng
zephyr zheng

Posted on • Originally published at telegra.ph

Subtitles From a YouTube Link Without Leaving the Browser

Last week I needed captions for a 14-minute conference talk to drop into a changelog entry. Three years ago I'd have reached for a shell: yt-dlp -x --audio-format mp3 <url>, then whisper input.mp3 --model small --output_format srt, then ffmpeg to sanity-check the audio if Whisper got confused by a music intro. Python env, ~2GB of model weights on disk, and a terminal window open for the whole thing. I just don't bother with any of that anymore.

My actual workflow now is two browser tabs. I paste the YouTube URL into a browser-based MP3 downloader, get the audio file, drop it into the transcriber I run them through, and export SRT. Whisper-tiny runs in ONNX quantized form at roughly 40MB, pulled once and cached in IndexedDB, so the second run starts instantly. No pip install, no brew install ffmpeg, no figuring out why CoreML is sulking at me today.

What changed underneath

The shift isn't about speed. Local Whisper on an M2 still beats the browser — distil-large-v3 is 6.3× faster than large-v3 at ~49% of the parameters and stays within 1% WER on long-form audio (Gandhi et al. 2023, HF model card), but that's running natively, not in a WebAssembly sandbox. What changed is that the extraction step and the inference step finally live in the same runtime. yt-dlp is still the most complete YouTube extractor on the planet — youtube-dl fork, Python CLI, thousands of site extractors, the tool I'd still reach for if I were batching fifty videos overnight. But for one video, shuffling a file between ~/Downloads and a model and a subtitle tool is three context switches I now skip.

The browser side got there via Transformers.js v3, which ships first-class WebGPU through ONNX Runtime Webdevice: 'webgpu' and you're off WASM. Audio extraction piggybacks on MediaRecorder / WebCodecs, both of which are now stable enough that a page can pull audio out of a video stream without a server round-trip. Put those together and the "three tools plus a Python env" stack collapses into a tab.

When I still open the terminal

I haven't deleted yt-dlp. For long videos (past about an hour the browser tab starts feeling it — memory pressure, tab backgrounding throttling), for batches (anything scripted), and for paranoid-accuracy work where I want large-v3 with word-level timestamps and VTT rather than SRT, local is still the right answer. If I'm captioning a podcast feed on a cron, that's a yt-dlp + Whisper pipeline and probably always will be. There's also the lossless WAV variant for cases where the MP3 re-encode actually matters to WER — usually it doesn't, but for thick accents or noisy recordings I've seen WAV input shave a few errors per minute.

So: the browser flow wins on ad-hoc work, privacy (nothing leaves the machine either way, but there's no local state to clean up), and the zero-setup case when I'm on a borrowed laptop. The CLI wins on volume, on long-tail model options, and on anything I want to script. These days the terminal sits idle most weeks for this kind of task, which still surprises me a little.

Top comments (0)