agenthustler

Posted on Apr 9

How to Scrape YouTube in 2026: Videos, Channels, Comments, and Transcripts

#python #webscraping #youtube #datascience

YouTube holds an enormous amount of publicly available data — video metadata, channel statistics, comments, and transcripts. Whether you're building a content research tool, training an NLP model, or doing competitive analysis, extracting this data programmatically is a valuable skill.

In this guide, I'll walk through every major approach to scraping YouTube in 2026, including the official API, popular Python libraries, and practical workarounds for the limitations you'll hit.

The YouTube Data API v3: Start Here

Google's YouTube Data API v3 is the most reliable way to access video metadata, channel info, and comments. It's free to use with a Google Cloud project, but comes with strict quota limits.

Setting Up

from googleapiclient.discovery import build

API_KEY = "YOUR_API_KEY"
youtube = build("youtube", "v3", developerKey=API_KEY)

You'll need to create a project in the Google Cloud Console, enable the YouTube Data API v3, and generate an API key.

Searching for Videos

def search_videos(query, max_results=50):
    videos = []
    request = youtube.search().list(
        q=query,
        part="snippet",
        type="video",
        maxResults=min(max_results, 50),
        order="relevance"
    )
    response = request.execute()

    for item in response["items"]:
        videos.append({
            "video_id": item["id"]["videoId"],
            "title": item["snippet"]["title"],
            "channel": item["snippet"]["channelTitle"],
            "published_at": item["snippet"]["publishedAt"],
            "description": item["snippet"]["description"]
        })
    return videos

results = search_videos("python web scraping tutorial")

Getting Detailed Video Statistics

The search endpoint doesn't return view counts or likes. You need a separate videos.list call:

def get_video_details(video_ids):
    request = youtube.videos().list(
        part="statistics,contentDetails,snippet",
        id=",".join(video_ids)
    )
    response = request.execute()

    details = []
    for item in response["items"]:
        details.append({
            "video_id": item["id"],
            "title": item["snippet"]["title"],
            "views": int(item["statistics"].get("viewCount", 0)),
            "likes": int(item["statistics"].get("likeCount", 0)),
            "comments": int(item["statistics"].get("commentCount", 0)),
            "duration": item["contentDetails"]["duration"]
        })
    return details

The Quota Problem

Here's where it gets painful. The YouTube API gives you 10,000 quota units per day by default. Different operations cost different amounts:

Operation	Cost
search.list	100 units
videos.list	1 unit
channels.list	1 unit
commentThreads.list	1 unit

A single search.list call burns 100 units — so you can only do 100 searches per day. That's a hard ceiling for any serious project.

Workarounds:

Use videos.list with known video IDs instead of search.list when possible (1 unit vs 100)
Cache results aggressively
Request a quota increase (Google sometimes grants 50K-100K for legitimate projects)
Use multiple API keys across different Google Cloud projects
Supplement the API with other extraction methods (see below)

yt-dlp: The Swiss Army Knife

yt-dlp is the community-maintained fork of youtube-dl, and it's incredibly powerful for metadata extraction — not just downloading.

import yt_dlp

def get_video_info(url):
    ydl_opts = {
        "quiet": True,
        "no_download": True,
        "extract_flat": False
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(url, download=False)

    return {
        "title": info.get("title"),
        "views": info.get("view_count"),
        "likes": info.get("like_count"),
        "duration": info.get("duration"),
        "upload_date": info.get("upload_date"),
        "channel": info.get("channel"),
        "subscriber_count": info.get("channel_follower_count"),
        "description": info.get("description"),
        "tags": info.get("tags"),
        "categories": info.get("categories")
    }

video = get_video_info("https://www.youtube.com/watch?v=dQw4w9WgXcQ")

The key advantage: no API quota limits. yt-dlp works by parsing YouTube's web pages and internal APIs directly.

Scraping Entire Channels

def scrape_channel_videos(channel_url):
    ydl_opts = {
        "quiet": True,
        "no_download": True,
        "extract_flat": True,
        "playlistend": 100
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(channel_url, download=False)

    videos = []
    for entry in info.get("entries", []):
        videos.append({
            "id": entry.get("id"),
            "title": entry.get("title"),
            "url": entry.get("url")
        })
    return videos

channel_videos = scrape_channel_videos("https://www.youtube.com/@mkbhd")

Important note: yt-dlp gets updated frequently to keep up with YouTube's changes. Always install the latest version:

pip install -U yt-dlp

Extracting Transcripts

Transcripts are gold for NLP work, content analysis, and building search indexes. The youtube-transcript-api library makes this straightforward:

from youtube_transcript_api import YouTubeTranscriptApi

def get_transcript(video_id, language="en"):
    try:
        transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

        try:
            transcript = transcript_list.find_manually_created_transcript([language])
        except:
            transcript = transcript_list.find_generated_transcript([language])

        data = transcript.fetch()
        full_text = " ".join([entry["text"] for entry in data])

        return {
            "text": full_text,
            "segments": data,
            "language": language,
            "is_generated": transcript.is_generated
        }
    except Exception as e:
        return {"error": str(e)}

result = get_transcript("dQw4w9WgXcQ")
print(result["text"][:500])

Batch Transcript Extraction

For large-scale transcript collection:

import time
import json

def batch_transcripts(video_ids, output_file="transcripts.jsonl"):
    results = []

    for i, vid in enumerate(video_ids):
        print(f"Processing {i+1}/{len(video_ids)}: {vid}")
        transcript = get_transcript(vid)
        transcript["video_id"] = vid
        results.append(transcript)

        with open(output_file, "a") as f:
            f.write(json.dumps(transcript) + "\n")

        time.sleep(1)

    success = sum(1 for r in results if "error" not in r)
    print(f"Done: {success}/{len(video_ids)} transcripts extracted")
    return results

Not every video has transcripts available. Auto-generated transcripts cover most English-language content, but you'll see gaps with music, very short clips, and some older videos.

Scraping Comments

Comments are useful for sentiment analysis, finding common questions, and understanding audience engagement.

Using the Official API

def get_comments(video_id, max_results=100):
    comments = []
    request = youtube.commentThreads().list(
        part="snippet",
        videoId=video_id,
        maxResults=min(max_results, 100),
        order="relevance",
        textFormat="plainText"
    )

    while request and len(comments) < max_results:
        response = request.execute()

        for item in response["items"]:
            comment = item["snippet"]["topLevelComment"]["snippet"]
            comments.append({
                "author": comment["authorDisplayName"],
                "text": comment["textDisplay"],
                "likes": comment["likeCount"],
                "published_at": comment["publishedAt"]
            })

        request = youtube.commentThreads().list_next(request, response)

    return comments[:max_results]

Using yt-dlp for Comments (No Quota)

yt-dlp can also extract comments without API quota costs:

def get_comments_ytdlp(video_url, max_comments=200):
    ydl_opts = {
        "quiet": True,
        "no_download": True,
        "getcomments": True,
        "extractor_args": {
            "youtube": {
                "max_comments": [str(max_comments)]
            }
        }
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(video_url, download=False)

    comments = []
    for c in info.get("comments", []):
        comments.append({
            "author": c.get("author"),
            "text": c.get("text"),
            "likes": c.get("like_count"),
            "timestamp": c.get("timestamp")
        })
    return comments

This is slower than the API but has no daily limits.

Channel Analytics Scraping

For competitive analysis, you often need channel-level data:

def get_channel_stats(channel_id):
    request = youtube.channels().list(
        part="statistics,snippet,contentDetails,brandingSettings",
        id=channel_id
    )
    response = request.execute()

    if not response["items"]:
        return None

    channel = response["items"][0]
    return {
        "name": channel["snippet"]["title"],
        "description": channel["snippet"]["description"],
        "subscribers": int(channel["statistics"]["subscriberCount"]),
        "total_views": int(channel["statistics"]["viewCount"]),
        "video_count": int(channel["statistics"]["videoCount"]),
        "created_at": channel["snippet"]["publishedAt"],
        "uploads_playlist": channel["contentDetails"]["relatedPlaylists"]["uploads"]
    }

Scaling Up: When Scripts Aren't Enough

The approaches above work well for small to medium projects. But if you need to scrape thousands of videos regularly, you'll run into challenges:

Rate limiting: YouTube will temporarily block IPs that send too many requests
Maintenance: YouTube changes its internal APIs frequently, breaking scrapers
Infrastructure: Running scrapers 24/7 requires proxy management and monitoring

For production-scale YouTube scraping, managed solutions save significant engineering time. I built a YouTube Scraper on Apify that handles all of this — proxies, retries, anti-bot detection — so you can focus on what you do with the data instead of maintaining scraper infrastructure.

Best Practices

Respect rate limits. Whether using the API or scraping directly, add delays between requests. YouTube will temporarily block aggressive scrapers.
Cache everything. Video metadata doesn't change frequently. Store results in a database and only re-fetch what you need.
Use the right tool for the job. Official API for structured queries with low volume. yt-dlp for metadata extraction at scale. youtube-transcript-api for transcripts.
Handle errors gracefully. Videos get deleted, channels go private, transcripts aren't available. Your code should handle all of these.
Stay current. YouTube regularly updates its anti-bot measures. Keep yt-dlp updated and monitor your scraper's success rate.

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("yt_scraper")

def scrape_with_monitoring(video_ids):
    success, failed = 0, 0

    for vid in video_ids:
        try:
            data = get_video_info(f"https://youtube.com/watch?v={vid}")
            success += 1
        except Exception as e:
            logger.warning(f"Failed {vid}: {e}")
            failed += 1

    rate = success / (success + failed) * 100
    logger.info(f"Success rate: {rate:.1f}% ({success}/{success+failed})")

    if rate < 90:
        logger.warning("Success rate below 90% — check for blocks or API changes")

Conclusion

YouTube scraping in 2026 comes down to combining the right tools: the official API for structured, quota-limited access; yt-dlp for flexible metadata extraction; and specialized libraries for transcripts and comments. Start with the simplest approach that meets your needs, and scale up from there.

For production workloads where you don't want to deal with proxy rotation and maintenance, check out the YouTube Scraper on Apify — it handles the infrastructure so you can focus on the data.

Happy scraping, and remember to be respectful of rate limits and terms of service.

Have questions or want to share your YouTube scraping setup? Drop a comment below.

DEV Community