YouTube has the largest collection of spoken-word content on the internet. Lectures, interviews, tutorials, podcasts — millions of hours of text locked inside video. Here's how to extract it at scale and feed it into RAG pipelines.
Why YouTube Transcripts for AI?
- Volume: YouTube has 800M+ videos. Most have auto-generated captions.
- Diversity: Covers every topic, language, and speaking style.
- Quality: Manual captions (available on ~30% of videos) are human-verified.
- Free: Unlike news articles behind paywalls, YouTube captions are publicly accessible.
Common use cases:
- Fine-tuning LLMs on domain-specific content (medical lectures, legal talks)
- Building RAG systems over video libraries
- Creating searchable indexes of conference talks
- Generating training data for speech-to-text models
The Technical Challenge
YouTube doesn't have a public "get transcript" API. The official Data API v3 provides metadata but not captions. You need to:
- Call YouTube's internal InnerTube API (used by the mobile/TV apps)
- Get caption track URLs from the player response
- Fetch and parse the caption XML
Why InnerTube?
The web player uses a JavaScript-based approach (ytInitialPlayerResponse) that often strips captions for embedded content. The InnerTube API, used by mobile apps, reliably returns caption tracks — even for music videos.
// Try multiple client types for maximum compatibility
const clients = [
{ name: 'ANDROID', clientVersion: '20.10.38' },
{ name: 'IOS', clientVersion: '20.10.4' },
{ name: 'TVHTML5_SIMPLY_EMBEDDED_PLAYER', clientVersion: '2.0' },
];
Building the Pipeline
Step 1: Extract Transcripts
from apify_client import ApifyClient
client = ApifyClient("YOUR_TOKEN")
run = client.actor("george.the.developer/youtube-transcript-scraper").call(run_input={
"urls": ["https://www.youtube.com/playlist?list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF"],
"language": "en",
"outputFormat": "full-text",
"includeMetadata": True,
"maxVideos": 200,
})
documents = []
for video in client.dataset(run["defaultDatasetId"]).iterate_items():
if video.get("hasTranscript"):
documents.append({
"text": video["transcriptText"],
"metadata": {
"source": video["videoUrl"],
"title": video["title"],
"channel": video["channelName"],
"date": video.get("publishDate", ""),
"word_count": video.get("wordCount", 0),
}
})
print(f"Extracted {len(documents)} transcripts")
Step 2: Chunk for RAG
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " "],
)
chunks = []
for doc in documents:
doc_chunks = splitter.split_text(doc["text"])
for i, chunk in enumerate(doc_chunks):
chunks.append({
"text": chunk,
"metadata": {**doc["metadata"], "chunk_index": i}
})
print(f"Created {len(chunks)} chunks from {len(documents)} videos")
Step 3: Embed and Index
from openai import OpenAI
import chromadb
openai_client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("youtube-lectures")
batch_size = 100
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i+batch_size]
texts = [c["text"] for c in batch]
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
collection.add(
documents=texts,
embeddings=[e.embedding for e in response.data],
metadatas=[c["metadata"] for c in batch],
ids=[f"chunk-{i+j}" for j in range(len(batch))],
)
Step 4: Query
results = collection.query(
query_texts=["What is backpropagation?"],
n_results=5,
)
for doc, metadata in zip(results["documents"][0], results["metadatas"][0]):
print(f"{metadata['title']}")
print(f" {metadata['source']}")
print(f" ...{doc[:200]}...")
Performance & Cost
For 200 videos from a Stanford ML playlist:
| Metric | Value |
|---|---|
| Transcripts extracted | 187/200 (93.5%) |
| Total words | 1.2M |
| Chunks created | 4,800 |
| Extraction time | ~3 minutes |
| Extraction cost | ~$0.94 |
| Embedding cost (OpenAI) | ~$0.05 |
| Total cost | ~$1.00 |
Handling Edge Cases
- No captions: ~7% of videos have no captions at all.
- Auto-generated quality: Auto-captions have ~5-10% word error rate. Filter to manual captions for critical applications.
-
Multiple languages: Request specific languages with the
languageparameter.
Source Code
Full extractor: github.com/the-ai-entrepreneur-ai-hub/youtube-transcript-extractor
Run it on Apify: apify.com/george.the.developer/youtube-transcript-scraper
Also available as an API on RapidAPI.
What are you building with YouTube transcript data? I'd love to hear about your RAG setup.
Top comments (0)