Relative Insanity

Posted on Feb 5

How to Get YouTube Transcripts: A Complete Developer's Guide

#youtube #api #webdev

How to Get YouTube Transcripts: A Complete Developer's Guide

YouTube transcripts unlock a world of possibilities—from building AI-powered video summarizers to creating searchable content databases, training machine learning models, or automating content repurposing. But getting transcripts programmatically isn't always straightforward.

In this guide, we'll walk through every method for extracting YouTube transcripts, complete with working code examples. By the end, you'll know exactly how to implement transcript extraction in your own projects.

Understanding YouTube Transcripts

Before diving into code, let's understand what we're working with.

YouTube stores transcripts (also called captions or subtitles) in a few different ways:

Auto-generated captions: YouTube's speech recognition creates these automatically for most videos
Manual captions: Uploaded by creators for better accuracy
Community contributions: Viewer-submitted captions (being phased out)

Each caption track includes the text and timing information, typically in formats like SRT, VTT, or YouTube's proprietary timedtext format.

Method 1: Using the youtube-transcript-api (Python)

The most popular open-source solution for Python developers is the youtube-transcript-api library.

Installation

pip install youtube-transcript-api

Basic Usage

from youtube_transcript_api import YouTubeTranscriptApi

# Extract transcript from a video
video_id = "dQw4w9WgXcQ"  # The video ID from the URL

try:
    transcript = YouTubeTranscriptApi.get_transcript(video_id)

    for entry in transcript:
        print(f"[{entry['start']:.2f}s] {entry['text']}")

except Exception as e:
    print(f"Error: {e}")

Output Format

The library returns a list of dictionaries:

[
    {'text': 'Hello everyone', 'start': 0.0, 'duration': 2.5},
    {'text': 'Welcome to my video', 'start': 2.5, 'duration': 3.0},
    # ...
]

Getting Transcripts in Different Languages

from youtube_transcript_api import YouTubeTranscriptApi

video_id = "dQw4w9WgXcQ"

# List available transcripts
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

for transcript in transcript_list:
    print(f"Language: {transcript.language} ({transcript.language_code})")
    print(f"Auto-generated: {transcript.is_generated}")

# Get specific language
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['es', 'en'])

# Translate to another language
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
transcript = transcript_list.find_transcript(['en'])
translated = transcript.translate('es').fetch()

Handling Videos Without Transcripts

from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api._errors import TranscriptsDisabled, NoTranscriptFound

video_id = "your_video_id"

try:
    transcript = YouTubeTranscriptApi.get_transcript(video_id)
except TranscriptsDisabled:
    print("Transcripts are disabled for this video")
except NoTranscriptFound:
    print("No transcript available for this video")
except Exception as e:
    print(f"An error occurred: {e}")

Combining Into Plain Text

from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import TextFormatter, SRTFormatter

video_id = "dQw4w9WgXcQ"
transcript = YouTubeTranscriptApi.get_transcript(video_id)

# Plain text
formatter = TextFormatter()
plain_text = formatter.format_transcript(transcript)
print(plain_text)

# SRT format
srt_formatter = SRTFormatter()
srt_output = srt_formatter.format_transcript(transcript)
print(srt_output)

Method 2: Using yt-dlp (Python/CLI)

yt-dlp is a powerful media downloader that can also extract subtitles.

Installation

pip install yt-dlp

Command Line Usage

# Download auto-generated subtitles
yt-dlp --write-auto-sub --skip-download "https://youtube.com/watch?v=VIDEO_ID"

# Download manual subtitles
yt-dlp --write-sub --skip-download "https://youtube.com/watch?v=VIDEO_ID"

# Specify format (srt, vtt, etc.)
yt-dlp --write-auto-sub --sub-format srt --skip-download "https://youtube.com/watch?v=VIDEO_ID"

# List available subtitles
yt-dlp --list-subs "https://youtube.com/watch?v=VIDEO_ID"

Python Usage

import yt_dlp

def get_transcript_with_ytdlp(video_url):
    ydl_opts = {
        'writeautomaticsub': True,
        'writesubtitles': True,
        'subtitlesformat': 'json3',
        'skip_download': True,
        'outtmpl': '%(id)s',
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(video_url, download=False)

        # Access subtitle information
        if 'subtitles' in info:
            print("Manual subtitles available:", list(info['subtitles'].keys()))
        if 'automatic_captions' in info:
            print("Auto captions available:", list(info['automatic_captions'].keys()))

        return info

video_url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
info = get_transcript_with_ytdlp(video_url)

Method 3: Direct YouTube API Approach

You can also fetch captions directly using YouTube's timedtext endpoint, though this method is less reliable and may break with YouTube updates.

import requests
import json
import re
from xml.etree import ElementTree

def get_transcript_direct(video_id):
    # First, get the video page to find caption tracks
    watch_url = f"https://www.youtube.com/watch?v={video_id}"
    response = requests.get(watch_url)

    # Extract caption track URL from the page
    # This is fragile and may break with YouTube updates
    pattern = r'"captions":.*?"captionTracks":\[(.*?)\]'
    match = re.search(pattern, response.text)

    if not match:
        return None

    caption_data = json.loads('[' + match.group(1) + ']')

    if not caption_data:
        return None

    # Get the first available caption track
    caption_url = caption_data[0]['baseUrl']

    # Fetch the captions
    caption_response = requests.get(caption_url)

    # Parse the XML
    root = ElementTree.fromstring(caption_response.text)

    transcript = []
    for text_element in root.findall('.//text'):
        transcript.append({
            'text': text_element.text or '',
            'start': float(text_element.get('start', 0)),
            'duration': float(text_element.get('dur', 0))
        })

    return transcript

# Usage
transcript = get_transcript_direct("dQw4w9WgXcQ")
if transcript:
    for entry in transcript:
        print(f"[{entry['start']:.2f}s] {entry['text']}")

Warning: This method parses YouTube's page structure directly and can break whenever YouTube updates their frontend. Use with caution in production.

Method 4: Building Your Own Audio-Based Transcription

When captions aren't available, you'll need to transcribe the audio yourself. Here's how to build a basic pipeline using OpenAI's Whisper.

Installation

pip install yt-dlp openai-whisper torch

Complete Pipeline

import yt_dlp
import whisper
import os

def download_audio(video_url, output_path="audio.mp3"):
    """Download audio from YouTube video"""
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
        'outtmpl': output_path.replace('.mp3', ''),
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([video_url])

    return output_path

def transcribe_audio(audio_path, model_size="base"):
    """Transcribe audio using Whisper"""
    model = whisper.load_model(model_size)
    result = model.transcribe(audio_path)
    return result

def get_transcript_with_whisper(video_url):
    """Complete pipeline: download and transcribe"""

    # Download audio
    print("Downloading audio...")
    audio_path = download_audio(video_url)

    # Transcribe
    print("Transcribing...")
    result = transcribe_audio(audio_path)

    # Clean up
    os.remove(audio_path)

    # Format output
    transcript = []
    for segment in result['segments']:
        transcript.append({
            'text': segment['text'].strip(),
            'start': segment['start'],
            'end': segment['end']
        })

    return transcript, result['text']

# Usage
video_url = "https://www.youtube.com/watch?v=VIDEO_ID"
segments, full_text = get_transcript_with_whisper(video_url)

print("Full transcript:")
print(full_text)

print("\nWith timestamps:")
for segment in segments:
    print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")

Getting Word-Level Timestamps

import whisper

def transcribe_with_word_timestamps(audio_path):
    model = whisper.load_model("base")
    result = model.transcribe(audio_path, word_timestamps=True)

    words = []
    for segment in result['segments']:
        if 'words' in segment:
            for word in segment['words']:
                words.append({
                    'word': word['word'].strip(),
                    'start': word['start'],
                    'end': word['end']
                })

    return words

# Usage
words = transcribe_with_word_timestamps("audio.mp3")
for word in words[:20]:  # First 20 words
    print(f"[{word['start']:.2f}s] {word['word']}")

Method 5: Node.js Implementation

For JavaScript/Node.js developers, here's how to extract transcripts.

Using youtube-transcript

npm install youtube-transcript

import { YoutubeTranscript } from 'youtube-transcript';

async function getTranscript(videoId) {
    try {
        const transcript = await YoutubeTranscript.fetchTranscript(videoId);

        transcript.forEach(entry => {
            console.log(`[${entry.offset.toFixed(2)}s] ${entry.text}`);
        });

        return transcript;
    } catch (error) {
        console.error('Error fetching transcript:', error.message);
        return null;
    }
}

// Usage
getTranscript('dQw4w9WgXcQ');

Building a Simple Express API

import express from 'express';
import { YoutubeTranscript } from 'youtube-transcript';

const app = express();

app.get('/transcript/:videoId', async (req, res) => {
    try {
        const { videoId } = req.params;
        const { format } = req.query;

        const transcript = await YoutubeTranscript.fetchTranscript(videoId);

        if (format === 'text') {
            const text = transcript.map(entry => entry.text).join(' ');
            res.type('text/plain').send(text);
        } else {
            res.json({
                videoId,
                transcript,
                totalDuration: transcript[transcript.length - 1]?.offset || 0
            });
        }
    } catch (error) {
        res.status(500).json({ error: error.message });
    }
});

app.listen(3000, () => {
    console.log('Transcript API running on port 3000');
});

Common Challenges and Solutions

Challenge 1: Rate Limiting

YouTube may rate-limit your requests if you make too many in a short period.

import time
from youtube_transcript_api import YouTubeTranscriptApi

def get_transcripts_with_rate_limit(video_ids, delay=1.0):
    """Fetch multiple transcripts with rate limiting"""
    transcripts = {}

    for video_id in video_ids:
        try:
            transcripts[video_id] = YouTubeTranscriptApi.get_transcript(video_id)
            print(f"✓ Got transcript for {video_id}")
        except Exception as e:
            print(f"✗ Failed for {video_id}: {e}")
            transcripts[video_id] = None

        time.sleep(delay)  # Wait between requests

    return transcripts

Challenge 2: Missing Transcripts

Not all videos have captions available.

def get_transcript_or_fallback(video_id):
    """Try to get transcript, with fallback options"""

    # Try auto-generated English first
    try:
        return YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])
    except:
        pass

    # Try any available transcript
    try:
        transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
        for transcript in transcript_list:
            return transcript.fetch()
    except:
        pass

    # No transcript available - would need audio-based transcription
    return None

Challenge 3: Cleaning Up Transcript Text

Auto-generated transcripts often have issues like missing punctuation, incorrect capitalization, and timing artifacts.

import re

def clean_transcript(transcript):
    """Clean and improve transcript text"""

    # Combine all text
    full_text = ' '.join([entry['text'] for entry in transcript])

    # Remove timing artifacts like [Music], [Applause]
    full_text = re.sub(r'\[.*?\]', '', full_text)

    # Fix multiple spaces
    full_text = re.sub(r'\s+', ' ', full_text)

    # Basic sentence detection (add periods where needed)
    # This is a simple heuristic - production code would use NLP
    full_text = full_text.strip()

    return full_text

def add_paragraphs(text, sentences_per_paragraph=4):
    """Split text into paragraphs"""
    import re

    # Split on sentence boundaries
    sentences = re.split(r'(?<=[.!?])\s+', text)

    paragraphs = []
    for i in range(0, len(sentences), sentences_per_paragraph):
        paragraph = ' '.join(sentences[i:i + sentences_per_paragraph])
        paragraphs.append(paragraph)

    return '\n\n'.join(paragraphs)

Real-World Example: Building a Video Summarizer

Let's put it all together with a practical example—a video summarizer that extracts transcripts and generates summaries using an LLM.

from youtube_transcript_api import YouTubeTranscriptApi
import openai
import os

openai.api_key = os.getenv("OPENAI_API_KEY")

def extract_video_id(url):
    """Extract video ID from YouTube URL"""
    import re
    patterns = [
        r'(?:v=|\/)([0-9A-Za-z_-]{11}).*',
        r'(?:embed\/)([0-9A-Za-z_-]{11})',
        r'(?:youtu\.be\/)([0-9A-Za-z_-]{11})',
    ]
    for pattern in patterns:
        match = re.search(pattern, url)
        if match:
            return match.group(1)
    return None

def get_transcript_text(video_id):
    """Get transcript as plain text"""
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        return ' '.join([entry['text'] for entry in transcript])
    except Exception as e:
        return None

def summarize_text(text, max_length=500):
    """Summarize text using GPT"""
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that summarizes video transcripts concisely."
            },
            {
                "role": "user",
                "content": f"Please summarize this video transcript in about {max_length} characters:\n\n{text[:8000]}"
            }
        ],
        max_tokens=500
    )
    return response.choices[0].message.content

def summarize_video(youtube_url):
    """Complete pipeline: URL to summary"""

    # Extract video ID
    video_id = extract_video_id(youtube_url)
    if not video_id:
        return {"error": "Invalid YouTube URL"}

    # Get transcript
    transcript = get_transcript_text(video_id)
    if not transcript:
        return {"error": "Could not fetch transcript"}

    # Generate summary
    summary = summarize_text(transcript)

    return {
        "video_id": video_id,
        "transcript_length": len(transcript),
        "summary": summary
    }

# Usage
result = summarize_video("https://www.youtube.com/watch?v=VIDEO_ID")
print(result['summary'])

The Challenges of Building Your Own Solution

While the methods above work, building a production-ready transcript extraction system comes with significant challenges:

Reliability Issues

YouTube frequently updates their frontend, breaking scraping-based solutions
Auto-generated captions aren't available for all videos
Rate limiting can disrupt high-volume applications

Missing Features

Basic libraries only give you sentence-level timestamps, not word-level precision
No semantic paragraph detection—you get a wall of text
No speaker identification for podcasts and interviews
No fallback when captions don't exist

Maintenance Burden

You'll spend time fixing breakages instead of building features
Audio-based transcription requires GPU infrastructure
Scaling brings additional complexity

Quality Concerns

Auto-generated captions often have errors
No built-in cleaning or formatting
Inconsistent results across different video types

The Easier Alternative: YouTubeTranscript.dev API

If you'd rather focus on building your application than maintaining transcript infrastructure, YouTubeTranscript.dev provides everything you need through a simple API.

Why Use a Managed API?

It Just Works

Never get "transcript not available" errors—audio-based fallback handles videos without captions
Consistent, reliable results every time
No maintenance when YouTube changes their systems

Advanced Features Out of the Box

Word-level timestamps for precision applications
Semantic paragraph segmentation—no more walls of text
Speaker diarization for multi-speaker content
Multiple output formats (JSON, SRT, VTT, plain text)

Simple Integration

curl "https://youtubetranscript.dev/api/transcript?video_id=VIDEO_ID" \
  -H "Authorization: Bearer YOUR_API_KEY"

import requests

response = requests.get(
    "https://youtubetranscript.dev/api/transcript",
    params={"video_id": "dQw4w9WgXcQ"},
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)

data = response.json()
print(data['transcript'])

Generous Free Tier

Start with 30 free credits per month—enough to test and prototype before committing. Paid plans start at just $9/month for 1,000 credits.

When to Build vs. Buy

Build your own if:

You're learning and want to understand how it works
You have very specific requirements no API meets
Volume is extremely low (a few videos per month)

Use YouTubeTranscript.dev if:

You need reliability for production applications
You want word-level timestamps or speaker detection
Videos sometimes don't have captions
You'd rather ship features than fix infrastructure

Conclusion

Getting YouTube transcripts programmatically is achievable with open-source tools, but building a robust, production-ready solution requires handling many edge cases—missing captions, rate limits, formatting issues, and YouTube's ever-changing systems.

For hobby projects and learning, the methods in this guide will serve you well. For production applications where reliability matters, consider offloading the complexity to a purpose-built API like YouTubeTranscript.dev.

Whatever path you choose, transcripts unlock powerful possibilities for your applications. Happy building!

Found this guide helpful? Check out YouTubeTranscript.dev for a hassle-free transcript API with features like word-level timestamps, speaker detection, and audio-based fallback transcription.

DEV Community

How to Get YouTube Transcripts: A Complete Developer's Guide

How to Get YouTube Transcripts: A Complete Developer's Guide

Understanding YouTube Transcripts

Method 1: Using the youtube-transcript-api (Python)

Installation

Basic Usage

Output Format

Getting Transcripts in Different Languages

Handling Videos Without Transcripts

Combining Into Plain Text

Method 2: Using yt-dlp (Python/CLI)

Installation

Command Line Usage

Python Usage

Method 3: Direct YouTube API Approach

Method 4: Building Your Own Audio-Based Transcription

Installation

Complete Pipeline

Getting Word-Level Timestamps

Method 5: Node.js Implementation

Using youtube-transcript

Building a Simple Express API

Common Challenges and Solutions

Challenge 1: Rate Limiting

Challenge 2: Missing Transcripts

Challenge 3: Cleaning Up Transcript Text

Real-World Example: Building a Video Summarizer

The Challenges of Building Your Own Solution

The Easier Alternative: YouTubeTranscript.dev API

Why Use a Managed API?

When to Build vs. Buy

Conclusion

Top comments (0)