DEV Community

agenthustler
agenthustler

Posted on

How to Scrape YouTube in 2026: Videos, Channels, Comments, and Metadata

YouTube hosts over 800 million videos. Whether you're building a competitor analysis tool, tracking trends, or collecting training data — extracting YouTube data programmatically is a common need.

In this guide, I'll walk you through the practical ways to scrape YouTube in 2026: what data you can get, Python code examples, and when to use the official API vs web scraping.

What YouTube Data Can You Extract?

You can collect:

  • Video metadata: title, description, view count, likes, upload date, duration, tags
  • Channel info: subscriber count, total videos, channel description, creation date
  • Comments: text, author, likes, reply count, timestamps
  • Search results: videos matching keywords, filters by date/relevance/views
  • Playlists: video lists, playlist metadata

Method 1: YouTube Data API v3

The official API is the cleanest option for structured data.

Setup

import requests

API_KEY = 'YOUR_YOUTUBE_API_KEY'
BASE_URL = 'https://www.googleapis.com/youtube/v3'

def get_video_details(video_id: str) -> dict:
    """Fetch video metadata via YouTube Data API."""
    params = {
        'part': 'snippet,statistics,contentDetails',
        'id': video_id,
        'key': API_KEY
    }
    response = requests.get(f'{BASE_URL}/videos', params=params)
    response.raise_for_status()
    items = response.json().get('items', [])
    if not items:
        return {}

    item = items[0]
    return {
        'title': item['snippet']['title'],
        'description': item['snippet']['description'],
        'views': int(item['statistics'].get('viewCount', 0)),
        'likes': int(item['statistics'].get('likeCount', 0)),
        'comments': int(item['statistics'].get('commentCount', 0)),
        'duration': item['contentDetails']['duration'],
        'published_at': item['snippet']['publishedAt'],
        'channel': item['snippet']['channelTitle'],
        'tags': item['snippet'].get('tags', []),
    }

video = get_video_details('dQw4w9WgXcQ')
print(f"{video['title']}{video['views']:,} views")
Enter fullscreen mode Exit fullscreen mode

Fetching Comments

def get_comments(video_id: str, max_results: int = 100) -> list:
    """Fetch top-level comments for a video."""
    comments = []
    params = {
        'part': 'snippet',
        'videoId': video_id,
        'maxResults': min(max_results, 100),
        'order': 'relevance',
        'key': API_KEY
    }

    while len(comments) < max_results:
        response = requests.get(
            f'{BASE_URL}/commentThreads', params=params
        )
        data = response.json()

        for item in data.get('items', []):
            snippet = item['snippet']['topLevelComment']['snippet']
            comments.append({
                'author': snippet['authorDisplayName'],
                'text': snippet['textDisplay'],
                'likes': snippet['likeCount'],
                'published_at': snippet['publishedAt'],
            })

        next_page = data.get('nextPageToken')
        if not next_page:
            break
        params['pageToken'] = next_page

    return comments[:max_results]
Enter fullscreen mode Exit fullscreen mode

API Limitations

The YouTube API has a 10,000 unit daily quota (free tier). Each search costs 100 units, video details cost 1-3 units. For large-scale collection, you'll hit limits fast.

Method 2: Web Scraping with Python

When the API quota isn't enough, or you need data the API doesn't expose (like exact subscriber counts or revenue estimates), web scraping fills the gap.

Basic Approach with yt-dlp

yt-dlp is the most reliable tool for extracting YouTube metadata without API keys:

import subprocess
import json

def scrape_video_metadata(url: str) -> dict:
    """Extract video metadata using yt-dlp."""
    cmd = [
        'yt-dlp',
        '--dump-json',
        '--no-download',
        '--no-warnings',
        url
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode != 0:
        raise RuntimeError(f'yt-dlp failed: {result.stderr}')

    data = json.loads(result.stdout)
    return {
        'title': data.get('title'),
        'description': data.get('description'),
        'views': data.get('view_count'),
        'likes': data.get('like_count'),
        'duration': data.get('duration'),
        'upload_date': data.get('upload_date'),
        'channel': data.get('channel'),
        'channel_subscribers': data.get('channel_follower_count'),
        'tags': data.get('tags', []),
        'categories': data.get('categories', []),
        'thumbnail': data.get('thumbnail'),
    }

video = scrape_video_metadata('https://www.youtube.com/watch?v=dQw4w9WgXcQ')
print(json.dumps(video, indent=2))
Enter fullscreen mode Exit fullscreen mode

Scraping Channel Videos

def scrape_channel_videos(channel_url: str, max_videos: int = 50) -> list:
    """Get all video metadata from a channel."""
    cmd = [
        'yt-dlp',
        '--dump-json',
        '--no-download',
        '--flat-playlist',
        '--playlist-end', str(max_videos),
        f'{channel_url}/videos'
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)

    videos = []
    for line in result.stdout.strip().split('\n'):
        if line:
            data = json.loads(line)
            videos.append({
                'id': data.get('id'),
                'title': data.get('title'),
                'views': data.get('view_count'),
                'duration': data.get('duration'),
                'url': data.get('url'),
            })

    return videos
Enter fullscreen mode Exit fullscreen mode

Method 3: Using a Managed Scraper

For production workloads where you need reliability and scale, a managed scraping solution saves you from maintaining infrastructure.

YouTube Scraper on Apify handles the heavy lifting — it extracts video metadata, channel info, comments, and search results with built-in proxy rotation and retry logic. You just provide the URLs and get structured JSON back.

Handling Anti-Scraping Measures

YouTube actively blocks automated requests. Here's what works in 2026:

Proxy Rotation

Using residential proxies is essential for any volume:

import requests

# Using ThorData residential proxies
# Sign up: https://affiliate.thordata.com/0a0x4nzu7tvv
proxies = {
    'http': 'http://user:pass@proxy.thordata.com:9090',
    'https': 'http://user:pass@proxy.thordata.com:9090',
}

response = requests.get(
    'https://www.youtube.com/watch?v=dQw4w9WgXcQ',
    proxies=proxies,
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
)
Enter fullscreen mode Exit fullscreen mode

Alternatively, ScraperAPI handles proxy rotation and CAPTCHA solving automatically:

import requests

SCRAPERAPI_KEY = 'YOUR_KEY'
url = 'https://www.youtube.com/watch?v=dQw4w9WgXcQ'

response = requests.get(
    f'http://api.scraperapi.com?api_key={SCRAPERAPI_KEY}&url={url}'
)
html = response.text
Enter fullscreen mode Exit fullscreen mode

Rate Limiting

import time
import random

def polite_request(url: str, session: requests.Session) -> requests.Response:
    """Make a request with random delay to avoid detection."""
    time.sleep(random.uniform(2, 5))
    return session.get(url)
Enter fullscreen mode Exit fullscreen mode

Storing the Data

For any serious scraping project, dump results to a structured format:

import csv
import json

def save_to_csv(videos: list, filename: str = 'youtube_data.csv'):
    """Save scraped video data to CSV."""
    if not videos:
        return

    keys = videos[0].keys()
    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=keys)
        writer.writeheader()
        writer.writerows(videos)

def save_to_json(videos: list, filename: str = 'youtube_data.json'):
    """Save scraped video data to JSON."""
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(videos, f, indent=2, ensure_ascii=False)
Enter fullscreen mode Exit fullscreen mode

YouTube API vs Scraping: When to Use What

Factor YouTube API Web Scraping
Quota 10K units/day Unlimited (with proxies)
Data quality Structured JSON Requires parsing
Setup API key required No auth needed
Cost Free (within quota) Proxy costs
Reliability High Breaks with site changes
Best for Small-medium projects Large-scale collection

Legal Considerations

YouTube's ToS restricts automated access. The YouTube API has its own terms of service. For web scraping:

  • Respect robots.txt
  • Don't overload servers
  • Use data responsibly
  • Check local laws (GDPR, CCPA apply to personal data in comments)

Wrapping Up

The best approach depends on your scale:

  1. < 10K requests/day: Use the YouTube Data API
  2. Moderate scale: Use yt-dlp with rate limiting
  3. Production scale: Use a managed YouTube scraper with built-in proxy rotation
  4. Custom needs: Build your own scraper with residential proxies

Pick the method that matches your volume and reliability needs, and always be respectful of the platform's resources.

Top comments (0)