Building an AI Training Data Pipeline with YouTube Transcripts

#rag

YouTube has the largest collection of spoken-word content on the internet. Lectures, interviews, tutorials, podcasts — millions of hours of text locked inside video. Here's how to extract it at scale and feed it into RAG pipelines.

Why YouTube Transcripts for AI?

Volume: YouTube has 800M+ videos. Most have auto-generated captions.
Diversity: Covers every topic, language, and speaking style.
Quality: Manual captions (available on ~30% of videos) are human-verified.
Free: Unlike news articles behind paywalls, YouTube captions are publicly accessible.

Common use cases:

Fine-tuning LLMs on domain-specific content (medical lectures, legal talks)
Building RAG systems over video libraries
Creating searchable indexes of conference talks
Generating training data for speech-to-text models

The Technical Challenge

YouTube doesn't have a public "get transcript" API. The official Data API v3 provides metadata but not captions. You need to:

Call YouTube's internal InnerTube API (used by the mobile/TV apps)
Get caption track URLs from the player response
Fetch and parse the caption XML

Why InnerTube?

The web player uses a JavaScript-based approach (ytInitialPlayerResponse) that often strips captions for embedded content. The InnerTube API, used by mobile apps, reliably returns caption tracks — even for music videos.

// Try multiple client types for maximum compatibility
const clients = [
    { name: 'ANDROID', clientVersion: '20.10.38' },
    { name: 'IOS', clientVersion: '20.10.4' },
    { name: 'TVHTML5_SIMPLY_EMBEDDED_PLAYER', clientVersion: '2.0' },
];

Building the Pipeline

Step 1: Extract Transcripts

from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")

run = client.actor("george.the.developer/youtube-transcript-scraper").call(run_input={
    "urls": ["https://www.youtube.com/playlist?list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF"],
    "language": "en",
    "outputFormat": "full-text",
    "includeMetadata": True,
    "maxVideos": 200,
})

documents = []
for video in client.dataset(run["defaultDatasetId"]).iterate_items():
    if video.get("hasTranscript"):
        documents.append({
            "text": video["transcriptText"],
            "metadata": {
                "source": video["videoUrl"],
                "title": video["title"],
                "channel": video["channelName"],
                "date": video.get("publishDate", ""),
                "word_count": video.get("wordCount", 0),
            }
        })

print(f"Extracted {len(documents)} transcripts")

Step 2: Chunk for RAG

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " "],
)

chunks = []
for doc in documents:
    doc_chunks = splitter.split_text(doc["text"])
    for i, chunk in enumerate(doc_chunks):
        chunks.append({
            "text": chunk,
            "metadata": {**doc["metadata"], "chunk_index": i}
        })

print(f"Created {len(chunks)} chunks from {len(documents)} videos")

Step 3: Embed and Index

from openai import OpenAI
import chromadb

openai_client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("youtube-lectures")

batch_size = 100
for i in range(0, len(chunks), batch_size):
    batch = chunks[i:i+batch_size]
    texts = [c["text"] for c in batch]

    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )

    collection.add(
        documents=texts,
        embeddings=[e.embedding for e in response.data],
        metadatas=[c["metadata"] for c in batch],
        ids=[f"chunk-{i+j}" for j in range(len(batch))],
    )

Step 4: Query

results = collection.query(
    query_texts=["What is backpropagation?"],
    n_results=5,
)

for doc, metadata in zip(results["documents"][0], results["metadatas"][0]):
    print(f"{metadata['title']}")
    print(f"   {metadata['source']}")
    print(f"   ...{doc[:200]}...")

Performance & Cost

For 200 videos from a Stanford ML playlist:

Metric	Value
Transcripts extracted	187/200 (93.5%)
Total words	1.2M
Chunks created	4,800
Extraction time	~3 minutes
Extraction cost	~$0.94
Embedding cost (OpenAI)	~$0.05
Total cost	~$1.00

Handling Edge Cases

No captions: ~7% of videos have no captions at all.
Auto-generated quality: Auto-captions have ~5-10% word error rate. Filter to manual captions for critical applications.
Multiple languages: Request specific languages with the language parameter.