How to Transcribe a YouTube Video (Free, in Under a Minute)

#javascript #webdev #ai #cloudflare

Building a "paste a YouTube link, get a transcript" feature sounds trivial until you deploy it to a server. The moment your request comes from a datacenter IP instead of a residential one, YouTube responds with LOGIN_REQUIRED or quietly serves nothing. Here's how VidTranscriber handles it.

The problem

There are two ways to get text from a YouTube video:

Existing captions — if the uploader (or YouTube's auto-caption) provides them, you can fetch the caption track directly. Fast, free, no transcription needed.
Transcribe the audio — pull the audio stream and run it through a speech-to-text model (Whisper-family). Works for any video, but costs compute.

Both start with talking to YouTube from your server — and that's where it breaks. YouTube aggressively gates datacenter traffic: the watch page and InnerTube API return LOGIN_REQUIRED, and naive audio fetching gets reCAPTCHA'd.

The approach

The fix is to separate where the request originates from where the work happens:

A Cloudflare Worker handles the user request and orchestration.
Caption/audio fetching is routed through a path whose egress isn't treated as a bot — so the LOGIN_REQUIRED wall doesn't trigger.
Captions, when available, become the primary path (no transcription cost). Only when there are no usable captions do we fall back to downloading audio and running Whisper.
Long jobs go onto a queue (Cloudflare Queues) so the request returns immediately and the transcript streams in as it completes.

Why captions-first matters

Most "transcript generator" traffic is for videos that already have captions — talks, tutorials, news. Serving those from the caption track is instant and free, which means the expensive Whisper path is reserved for the minority of videos that actually need it. That's the difference between a tool that's cheap to run and one that isn't.

What's still hard

IP reputation drifts — what works today can get throttled tomorrow, so the extraction path needs monitoring and fallbacks.
Caption quality varies — auto-generated captions lack punctuation and speaker labels, so for quality output we sometimes re-transcribe even when captions exist.
Very long videos need chunking to stay within memory and timeout budgets.