How I scraped 50k YouTube subtitles in 2 weeks for $7 (and the legal gray zones)

#youtube #scraping #sideprojects #indie

When I started building TubeVocab, I had a chicken-and-egg problem. I needed a corpus of YouTube subtitles to mine ESL vocabulary from — but the official YouTube Data API v3 doesn't return subtitle bodies unless you own the channel. The captions.download endpoint? Auth-locked to channel owners.

So I had to find another way. Two weeks, 50,247 videos, $7.12 in egress costs, and one mild panic about ToS later, here's what actually worked.

The undocumented endpoint nobody talks about

Every YouTube watch page hits this URL pattern internally to fetch caption tracks:

https://www.youtube.com/api/timedtext?lang=en&v=VIDEO_ID&fmt=json3

It returns JSON. No auth. No quota. No API key. It's the same endpoint the YouTube web player uses to render captions on your screen. As far as I can tell, it's been there since ~2014 and Google hasn't deprecated it because their own player depends on it.

The catch: you need the right lang code, and for auto-generated captions (which is 80% of educational content) you need an extra param &kind=asr. And to get the list of available tracks you first hit:

https://www.youtube.com/api/timedtext?type=list&v=VIDEO_ID

Which returns XML (yes, mixed format APIs — very 2014). I parse the <track> nodes, prefer manual en over en (auto), then fetch the json3.

When timedtext fails, yt-dlp picks up

About 4% of videos return empty timedtext responses even though the player UI shows captions. I never figured out exactly why — maybe regional caption availability, maybe age-gated content, maybe a stale cache somewhere on YouTube's edge.

Fallback was yt-dlp --skip-download --write-auto-subs --sub-format json3 --sub-langs en. Slower (it has to resolve the player JS), but works on the long tail. I shell out to it from Python only when the direct endpoint returns nothing usable:

def fetch_subs(video_id: str) -> dict | None:
    payload = direct_timedtext(video_id)
    if payload and payload.get("events"):
        return payload
    return ytdlp_fallback(video_id)  # ~3s vs 200ms

Batch processing: the part that actually saved money

Naive scraping was 1 video per request from my laptop. After ~500 videos I noticed YouTube started 429-ing me from a single IP. So I rebuilt the pipeline with three constraints:

One Cloud Run job per ~5k video batch. Cloud Run gives me a fresh egress IP per cold start. 10 cold starts per night = 10 different IPs.
3 concurrent workers per job, 250ms jitter between requests. Below 12 req/sec/IP nothing throttled.
Subtitles → newline-delimited JSON → gzip → GCS. Storing each video as a separate file killed me on small-object overhead. Batching 5k videos into one .ndjson.gz (~38MB) brought storage cost from $0.42/k to $0.008/k.

Total scraping cost over 2 weeks:

Line item	Cost
Cloud Run compute (10 jobs × 14 nights × ~6 min each)	$4.31
GCS Standard storage (~12 GB)	$0.24
GCS egress to my dev box for sampling	$1.07
Misc (BigQuery loads, Cloud Logging)	$1.50
Total	$7.12

The legal gray zone (I am not a lawyer)

YouTube's ToS section 5.B prohibits "accessing the Service using any automated means... other than through the YouTube API." Strict reading: my timedtext scraping violates ToS.

But — and this is where I made a judgment call — I'm not redistributing the subtitle text. I'm extracting vocabulary frequencies, lemmas, and CEFR difficulty bands from them, then storing only metadata (word + video_id + timestamp) in my user-facing DB. The raw subtitle blobs sit in cold GCS and never leave my pipeline.

I also exclude any video where I detect a copyright strike claim in the description, and I respect the channel's <meta name="robots"> header even though there's no legal requirement to. It's a vibes-based defense, but if a takedown email ever arrives, my response is "deleted within the hour."

Two months in, no email yet.

What I'd do differently

Start with yt-dlp first, profile it, then optimize the hot path with direct timedtext. I burned 3 days writing direct-endpoint code before realizing the fallback covered 96% of cases anyway and was simpler to maintain.
Don't store raw subtitles in your prod DB. Process → extract → discard. SQLite was 11GB before I noticed.
Keep a videos-attempted table separate from videos-succeeded. I lost count of how many times I re-scraped failures because I couldn't tell what I'd already tried.

The pipeline now runs unattended and pulls in ~2k new videos per night across 40 ESL-relevant channels. Total marginal cost per video: $0.00014. Total time I spend maintaining it: ~10 minutes a week.

If you're curious how this corpus turns into an actual learning product, that's TubeVocab — same scraper, plus a frontend that ranks vocab by CEFR level and lets you click through to the exact second a word was spoken.