Devil Scrapes

Posted on Jun 4

YouTube Transcript Scraper: bulk-download captions for RAG, AI, and show notes

#webscraping #python #ai #apify

Quick answer: YouTube exposes closed captions through a timedtext endpoint, but that endpoint is undocumented, parameter-shifting, and rate-limits aggressively on bulk requests. A YouTube transcript scraper fetches the caption track for each video ID, picks the best language-matched track (manual preferred over auto-generated), and returns both a plain-text transcript and per-cue timed segments as structured JSON. The Apify Actor at apify.com/DevilScrapes/youtube-transcript-scraper does it for $0.004 per transcript (~$4.00 per 1,000 videos), with proxy rotation, retries, and multilingual support handled for you.

You're building a RAG system over a curated YouTube playlist. Or you're a podcast producer who wants automatic show notes every week. Or you're doing discourse analysis on three years of conference talks. The first step is the same: get the transcripts out of YouTube as structured text.

That sounds easy. It isn't. The timedtext endpoint behind YouTube's caption UI is undocumented, returns XML in a shape that has quietly changed multiple times, and does not tolerate bulk fetching from datacenter IPs. Every popular library here — youtube-transcript-api included — works for one-off scripts and breaks silently on scheduled, high-volume runs. Here's what's actually happening, and how I packaged the fix.

What is YouTube's transcript system? 🎬

YouTube generates or ingests captions for most public videos and serves them through an internal timedtext endpoint. There are two types:

Manual transcripts — uploaded or corrected by the video owner. Usually more accurate, sometimes unavailable on older videos.
Auto-generated (ASR) transcripts — produced by Google's speech recognition. Present on the vast majority of English-language videos uploaded since ~2012, and increasingly available in non-English languages.

Both types are public metadata — they're what populates the subtitle UI every viewer sees. They are not the video itself: scraping transcripts is in the same category as scraping a web page's text, not downloading a video file. The Actor never fetches video content and does not work on private or unlisted videos.

Does YouTube have a transcript API? 🔎

No. YouTube's official Data API v3 does not expose transcript data for arbitrary videos. The YouTube Data API lets you fetch video metadata, comments, and captions metadata (a list of available tracks), but caption content only for videos where you own the channel. The captions.download endpoint requires OAuth against the channel's own account.

The only route to public transcripts is the timedtext endpoint the website uses. That endpoint is undocumented, inspects your request shape, and is the reason a hardened Actor exists instead of a three-line snippet.

What the data looks like 📄

Each video comes back as one flat, typed row. Every field comes from src/models.py:

{
  "video_id": "dQw4w9WgXcQ",
  "video_url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
  "title": "Rick Astley - Never Gonna Give You Up (Official Music Video)",
  "channel_name": "Rick Astley",
  "channel_id": "UCuAXFkgsw1L7xaCfnd5JJOw",
  "duration_seconds": 212,
  "language": "en",
  "is_auto_generated": false,
  "transcript_text": "We're no strangers to love\nYou know the rules and so do I\n...",
  "segments": [
    { "text": "We're no strangers to love", "start": 18.5, "duration": 3.2 },
    { "text": "You know the rules and so do I", "start": 22.8, "duration": 3.5 }
  ],
  "available_languages": ["en", "es", "fr", "de"],
  "scraped_at": "2026-05-31T09:00:00+00:00"
}

Twelve fields, Pydantic-validated before they hit your dataset. The transcript_text drops straight into a vector store; the segments array is what you feed a sentence-splitter when chunk boundaries matter.

The naive approach (and why it falls apart) 🛠️

The standard first attempt looks like this:

from youtube_transcript_api import YouTubeTranscriptApi
for video_id in ids:
    data = YouTubeTranscriptApi.get_transcript(video_id)

This works perfectly in a notebook. It breaks in production for three compounding reasons:

1. IP-level rate limiting. YouTube's timedtext infrastructure rate-limits requests from datacenter IP ranges significantly more aggressively than from residential IPs. A batch of a few hundred videos from a cloud VM will see throttling that a single laptop never encounters. We thread Apify residential proxies with sticky sessions on every run — fresh session_id, fresh exit IP when the target pushes back. On free plans, datacenter proxies handle small runs just fine; paid plans unlock residential routing for bulk work.

2. TLS fingerprinting on the watch-page fetch. To extract video metadata (title, channel, duration), we parse the watch page — and that page's request stack inspects the TLS ClientHello. Python's stdlib SSL and httpx both emit handshakes that don't match any real browser. We rotate browser fingerprints via curl-cffi, cycling through Chrome, Firefox, and Safari TLS + HTTP/2 profiles, so the handshake looks like a browser — because at the TLS layer it is one.

3. Parameter drift. Google quietly changes timedtext parameters every few months. A scraper that worked in January starts returning empty results in March with no error. We monitor the endpoint shape and push fixes on a 48-hour SLO, and our run logs surface partial failures loudly rather than silently emptying your dataset.

On 408 / 429 / 5xx, we retry with exponential backoff (start 2s, double, cap 30s, max 5 attempts) and honour Retry-After headers. Partial success surfaces as a clear status message; we never hand you a half-empty dataset with a green checkmark.

The Actor ⚙️

The Actor is at apify.com/DevilScrapes/youtube-transcript-scraper. Paste video URLs into the Apify Console and click Start, or drive it programmatically:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/youtube-transcript-scraper").call(
    run_input={
        "videoUrls": [
            "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
            "dQw4w9WgXcQ",          # bare IDs work too
            "https://youtu.be/dQw4w9WgXcQ"   # youtu.be short links work too
        ],
        "language": "en",
        "includeSegments": True,
        "concurrency": 4,
        "proxyConfiguration": {"useApifyProxy": True}
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["transcript_text"][:200])

Key input parameters:

Field	Default	Notes
`videoUrls`	—	Full URLs, bare IDs, or youtu.be short links. Shorts accepted.
`language`	`en`	ISO-639-1 code. Track preference: manual language → auto language → manual any → auto any.
`includeSegments`	`true`	Adds per-cue `{ text, start, duration }` array alongside `transcript_text`.
`concurrency`	`4`	Parallel fetches. Max 16.

Export the finished dataset as JSON, CSV, or Excel from the Apify Console, or stream it via the Apify Dataset API.

Use cases 💡

RAG corpus for video knowledge bases. Fetch transcripts for a playlist — a conference archive, a course series, a curated topic library — then chunk and embed. The segments array gives you natural sentence-length boundaries that produce better chunks than arbitrary character splits.

Automated show notes for podcasters. Many podcast producers upload episodes to YouTube. A weekly automation can fetch transcripts for new uploads, pass them to an LLM, and publish show notes without a human transcription step. At $0.004 per episode, a 50-episode back-catalogue costs $0.20 in transcript data.

Multilingual content analysis. The Actor returns available_languages for every video. Request Spanish transcripts on an English-language channel and build a parallel corpus for translation benchmarking or cross-lingual NLP.

Discourse and policy research. Fetch a defined corpus of regulatory hearings, earnings calls, or conference talks. The is_auto_generated flag lets you weight manual transcripts more heavily where accuracy matters.

Search index over a creator's back-catalogue. Build an internal "search everything this creator has ever said" tool for a team or newsroom across a channel's full transcript archive.

Pricing — exact numbers 💰

Pay-per-event. You pay per transcript delivered, nothing for videos that return no captions.

Event	USD
Actor start (one-off per run)	$0.005
Result (per transcript)	$0.004

Batch size	Cost
100 transcripts	$0.41
1,000 transcripts	$4.01
10,000 transcripts	$40.01
100,000 transcripts (monthly corpus)	$400.01

Apify's $5 free trial credit covers your first ~1,240 transcripts with no credit card required, and there's no subscription or minimum.

The technically interesting part 🔬

The detail most transcript tools don't mention: YouTube's caption-track selection logic. When you request language en, there are up to four candidate tracks in priority order:

Manual transcript in English — the "gold" track when available
Auto-generated ASR in English — present on most modern videos
Manual transcript in another language — fallback for non-English channels
Auto-generated ASR in another language

The ordering matters for NLP pipelines because auto-generated transcripts carry filler words, broken sentence boundaries, and ASR errors that manual tracks don't. The Actor implements this exact priority chain and flags the result with is_auto_generated so you can filter or weight downstream.

Limitations 🚧

Some videos have captions disabled. The uploader controls this. Affected videos are skipped — no charge for that item.
Live streams have no transcript until after the broadcast ends. Live captions are a separate system.
Age-restricted or private videos are not accessible. The Actor reads only public metadata and does not authenticate as a user.
YouTube rate-limits aggressive bulk runs. Residential proxy routing handles this for large batches; datacenter is fine for small runs.
Auto-generated transcripts carry ASR noise. Homophones, proper nouns, and domain-specific terms come back wrong. The is_auto_generated flag lets you weight downstream accordingly.

FAQ

Is scraping YouTube transcripts legal?
Transcripts are public metadata — the same text any viewer reads in the subtitle overlay. The Actor uses the same endpoint YouTube's own website calls, accesses only public videos, and collects no personal data. As always, consult your own legal counsel for your jurisdiction and use case. The Actor paces its requests responsibly and does not interfere with normal platform operation.

Is this the same as youtube-transcript-api?
youtube-transcript-api is an excellent open-source library — use it for one-off scripts. Use this Actor when you need batched, scheduled, proxy-rotated, automatically-retried runs at scale, without maintaining your own infrastructure. The two are complementary.

Can I export to a warehouse or feed a vector store?
Yes — export JSON/CSV/Excel from the Apify Console, or trigger a webhook on ACTOR.RUN.SUCCEEDED to pipe the dataset into S3, BigQuery, Pinecone, or Weaviate via Make, n8n, or Zapier. The Apify API also lets you pull items directly.

What if a video has no English transcript but has Spanish?
Set language to es, or leave it at en — in fallback mode, the Actor will still return the best available track and record the actual language in the language field. The available_languages array shows you everything that was on offer.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/youtube-transcript-scraper

Free $5 trial credit, no credit card. Drop a playlist of 100 videos and you'll have a transcript dataset in a few minutes for under a dollar. Building something on top — a RAG, a show-notes pipeline, a discourse corpus? Drop the use case in the comments; the roadmap follows what people ship.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community