Vhub Systems

Posted on Apr 3

How to Scrape YouTube Comments Without the API (Reverse Engineering InnerTube)

#python #webdev #api #tutorial

YouTube's official Data API v3 gives you 10,000 units per day. A single commentThreads.list request costs 1 unit — so you get 10,000 comment pages per day maximum. In practice, for any analysis at scale, you hit this limit in minutes.

There's a better way: YouTube's internal InnerTube API, which is what the YouTube website itself uses. No quota limits, no API key required.

What is InnerTube?

InnerTube is YouTube's internal JSON API. Every request your browser makes when loading YouTube — video metadata, comments, search results — goes through InnerTube endpoints at https://www.youtube.com/youtubei/v1/.

These endpoints are technically public (your browser hits them every time you watch a video), but they're undocumented and can change without notice.

Getting the required context

InnerTube requests need a context object that mimics a real browser client. You can capture this by opening YouTube in Chrome DevTools → Network → filter for youtubei and inspect any request.

The static values that work as of April 2026:

INNERTUBE_CONTEXT = {
    "client": {
        "clientName": "WEB",
        "clientVersion": "2.20240101.01.00",
        "hl": "en",
        "gl": "US",
        "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }
}

INNERTUBE_API_KEY = "AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8"

Fetching comments for a video

import requests
import json

def get_youtube_comments(video_id: str, max_comments: int = 1000) -> list:
    """Fetch YouTube comments using InnerTube API"""

    url = "https://www.youtube.com/youtubei/v1/next"
    params = {"key": INNERTUBE_API_KEY}

    # Initial request
    payload = {
        "context": INNERTUBE_CONTEXT,
        "videoId": video_id,
        "params": "Eg0SCDMjYEiDAhAC"  # Comment sort = top comments
    }

    headers = {
        "Content-Type": "application/json",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }

    response = requests.post(url, json=payload, params=params, headers=headers)
    data = response.json()

    comments = []
    continuation_token = None

    # Parse initial response
    comments.extend(_extract_comments(data))
    continuation_token = _get_continuation_token(data)

    # Paginate
    while continuation_token and len(comments) < max_comments:
        payload = {
            "context": INNERTUBE_CONTEXT,
            "continuation": continuation_token
        }

        response = requests.post(
            "https://www.youtube.com/youtubei/v1/next",
            json=payload,
            params=params,
            headers=headers
        )
        data = response.json()

        new_comments = _extract_comments(data)
        if not new_comments:
            break

        comments.extend(new_comments)
        continuation_token = _get_continuation_token(data)

    return comments[:max_comments]

Parsing the response

InnerTube responses are deeply nested. The comment data lives in a renderer path:

def _extract_comments(data: dict) -> list:
    """Extract comment objects from InnerTube response"""
    comments = []

    # Navigate the response tree
    tabs = data.get("engagementPanels", [])
    for panel in tabs:
        try:
            # Comments are in the engagementPanelSectionListRenderer
            items = (panel
                .get("engagementPanelSectionListRenderer", {})
                .get("content", {})
                .get("sectionListRenderer", {})
                .get("contents", []))

            for item in items:
                comment_thread = (item
                    .get("itemSectionRenderer", {})
                    .get("contents", []))

                for c in comment_thread:
                    renderer = c.get("commentThreadRenderer", {})
                    comment = renderer.get("comment", {}).get("commentRenderer", {})

                    if comment:
                        text_runs = comment.get("contentText", {}).get("runs", [])
                        text = "".join(r.get("text", "") for r in text_runs)

                        comments.append({
                            "id": comment.get("commentId"),
                            "author": comment.get("authorText", {}).get("simpleText", ""),
                            "text": text,
                            "likes": _parse_count(comment.get("voteCount", {}).get("simpleText", "0")),
                            "published": comment.get("publishedTimeText", {}).get("simpleText", ""),
                            "is_reply": False
                        })
        except (KeyError, TypeError):
            continue

    return comments

def _get_continuation_token(data: dict) -> str | None:
    """Extract continuation token for pagination"""
    try:
        for panel in data.get("engagementPanels", []):
            items = (panel
                .get("engagementPanelSectionListRenderer", {})
                .get("content", {})
                .get("sectionListRenderer", {})
                .get("continuations", []))

            for cont in items:
                token = (cont
                    .get("nextContinuationData", {})
                    .get("continuation"))
                if token:
                    return token
    except (KeyError, TypeError):
        pass
    return None

def _parse_count(text: str) -> int:
    """Parse YouTube count strings like '1.2K', '4.5M'"""
    text = text.strip().replace(",", "")
    if text.endswith("K"):
        return int(float(text[:-1]) * 1000)
    elif text.endswith("M"):
        return int(float(text[:-1]) * 1_000_000)
    try:
        return int(text)
    except ValueError:
        return 0

Usage

# Get top 500 comments for a video
comments = get_youtube_comments("dQw4w9WgXcQ", max_comments=500)

for comment in comments[:5]:
    print(f"{comment['author']}: {comment['text'][:100]}")
    print(f"  Likes: {comment['likes']} | Posted: {comment['published']}")
    print()

Alternative: Use the Apify scraper

If you don't want to maintain the InnerTube parsing logic (which breaks when YouTube updates its response format), there's a pre-built actor that handles this:

The YouTube Comment Scraper on Apify handles the InnerTube parsing, rate limiting, and rotation automatically. Input a video URL or list, get structured JSON output.

Handling rate limits

InnerTube doesn't have explicit rate limits but will start returning empty responses if you hammer it. Practical limits from testing:

~100 requests/minute per IP before throttling
~500 comment pages per session before needing a fresh session

For high-volume extraction, rotate residential IPs and add 1-2 second delays between requests.

What this gives you

The InnerTube approach returns the same data as the official API (comment text, author, likes, published date, reply counts) with no daily quota cap. For most analysis tasks — sentiment analysis, spam detection, competitor research, audience research — this is everything you need.

The tradeoff: the response structure changes periodically without notice. Budget 1-2 hours per year maintaining the parser.

Skip the maintenance overhead

If you'd rather not deal with parser maintenance, the Apify Scrapers Bundle ($29) includes a pre-built YouTube comment scraper that handles all of this automatically.

One-time purchase. Documented. Production-ready.

DEV Community