agenthustler

Posted on Mar 26 • Edited on Apr 19

How to Scrape YouTube in 2026: Videos, Channels, Comments, and Metadata

#python #webdev #tutorial #webscraping

YouTube hosts over 800 million videos. Whether you're building a competitor analysis tool, tracking trends, or collecting training data — extracting YouTube data programmatically is a common need.

In this guide, I'll walk you through the practical ways to scrape YouTube in 2026: what data you can get, Python code examples, and when to use the official API vs web scraping.

What YouTube Data Can You Extract?

You can collect:

Video metadata: title, description, view count, likes, upload date, duration, tags
Channel info: subscriber count, total videos, channel description, creation date
Comments: text, author, likes, reply count, timestamps
Search results: videos matching keywords, filters by date/relevance/views
Playlists: video lists, playlist metadata

Method 1: YouTube Data API v3

The official API is the cleanest option for structured data.

Setup

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Fetching Comments

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

API Limitations

The YouTube API has a 10,000 unit daily quota (free tier). Each search costs 100 units, video details cost 1-3 units. For large-scale collection, you'll hit limits fast.

Method 2: Web Scraping with Python

When the API quota isn't enough, or you need data the API doesn't expose (like exact subscriber counts or revenue estimates), web scraping fills the gap.

Basic Approach with yt-dlp

yt-dlp is the most reliable tool for extracting YouTube metadata without API keys:

import subprocess
import json

def scrape_video_metadata(url: str) -> dict:
    """Extract video metadata using yt-dlp."""
    cmd = [
        'yt-dlp',
        '--dump-json',
        '--no-download',
        '--no-warnings',
        url
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode != 0:
        raise RuntimeError(f'yt-dlp failed: {result.stderr}')

    data = json.loads(result.stdout)
    return {
        'title': data.get('title'),
        'description': data.get('description'),
        'views': data.get('view_count'),
        'likes': data.get('like_count'),
        'duration': data.get('duration'),
        'upload_date': data.get('upload_date'),
        'channel': data.get('channel'),
        'channel_subscribers': data.get('channel_follower_count'),
        'tags': data.get('tags', []),
        'categories': data.get('categories', []),
        'thumbnail': data.get('thumbnail'),
    }

video = scrape_video_metadata('https://www.youtube.com/watch?v=dQw4w9WgXcQ')
print(json.dumps(video, indent=2))

Scraping Channel Videos

def scrape_channel_videos(channel_url: str, max_videos: int = 50) -> list:
    """Get all video metadata from a channel."""
    cmd = [
        'yt-dlp',
        '--dump-json',
        '--no-download',
        '--flat-playlist',
        '--playlist-end', str(max_videos),
        f'{channel_url}/videos'
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)

    videos = []
    for line in result.stdout.strip().split('\n'):
        if line:
            data = json.loads(line)
            videos.append({
                'id': data.get('id'),
                'title': data.get('title'),
                'views': data.get('view_count'),
                'duration': data.get('duration'),
                'url': data.get('url'),
            })

    return videos

Method 3: Using a Managed Scraper

For production workloads where you need reliability and scale, a managed scraping solution saves you from maintaining infrastructure.

YouTube Scraper on Apify handles the heavy lifting — it extracts video metadata, channel info, comments, and search results with built-in proxy rotation and retry logic. You just provide the URLs and get structured JSON back.

Handling Anti-Scraping Measures

YouTube actively blocks automated requests. Here's what works in 2026:

Proxy Rotation

Using residential proxies is essential for any volume:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Alternatively, ScraperAPI handles proxy rotation and CAPTCHA solving automatically:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Rate Limiting

import time
import random

def polite_request(url: str, session: requests.Session) -> requests.Response:
    """Make a request with random delay to avoid detection."""
    time.sleep(random.uniform(2, 5))
    return session.get(url)

Storing the Data

For any serious scraping project, dump results to a structured format:

import csv
import json

def save_to_csv(videos: list, filename: str = 'youtube_data.csv'):
    """Save scraped video data to CSV."""
    if not videos:
        return

    keys = videos[0].keys()
    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=keys)
        writer.writeheader()
        writer.writerows(videos)

def save_to_json(videos: list, filename: str = 'youtube_data.json'):
    """Save scraped video data to JSON."""
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(videos, f, indent=2, ensure_ascii=False)

YouTube API vs Scraping: When to Use What

Factor	YouTube API	Web Scraping
Quota	10K units/day	Unlimited (with proxies)
Data quality	Structured JSON	Requires parsing
Setup	API key required	No auth needed
Cost	Free (within quota)	Proxy costs
Reliability	High	Breaks with site changes
Best for	Small-medium projects	Large-scale collection

Legal Considerations

YouTube's ToS restricts automated access. The YouTube API has its own terms of service. For web scraping:

Respect robots.txt
Don't overload servers
Use data responsibly
Check local laws (GDPR, CCPA apply to personal data in comments)

Wrapping Up

The best approach depends on your scale:

< 10K requests/day: Use the YouTube Data API
Moderate scale: Use yt-dlp with rate limiting
Production scale: Use a managed YouTube scraper with built-in proxy rotation
Custom needs: Build your own scraper with residential proxies

Pick the method that matches your volume and reliability needs, and always be respectful of the platform's resources.

Top comments (1)

Blanche • May 19

Solid guide. One thing worth adding for anyone trying to run this at volume: the proxy layer becomes the main reliability variable once you get past the basics.

YouTube actively blocks datacenter IPs — even with proper headers and timing, AWS/GCP egress ranges get flagged much faster than residential IPs. For occasional scraping yt-dlp handles it well, but if you're running scheduled jobs or scraping at any meaningful scale, a residential proxy pool makes a significant difference in session stability.

Also worth noting: if you're using Apify's actor for this, the built-in proxy rotation uses their residential pool — that's a big part of why it "just works" compared to a bare requests setup.