agenthustler

Posted on Apr 9

How to Scrape Twitch in 2026: Streams, Clips, and Channel Data

#python #webscraping #api #tutorial

Twitch remains the dominant live streaming platform in 2026, with over 7 million unique streamers going live each month. Whether you're building a streamer analytics dashboard, tracking gaming trends, or researching content creator ecosystems, getting structured data out of Twitch is a common need.

But here's the catch: Twitch's official API has gotten more restrictive over the years, rate limits have tightened, and some data points that used to be freely available now require special access. In this guide, I'll walk through the practical approaches to collecting Twitch data in 2026 — from the official API to web scraping alternatives.

The State of the Twitch API in 2026

Twitch's Helix API is still the primary official interface. To use it, you need to register an application on the Twitch Developer Console and obtain a Client ID and OAuth token.

Here's a quick setup to authenticate and fetch stream data:

import requests

CLIENT_ID = "your_client_id"
CLIENT_SECRET = "your_client_secret"

# Get OAuth token
auth_response = requests.post(
    "https://id.twitch.tv/oauth2/token",
    params={
        "client_id": CLIENT_ID,
        "client_secret": CLIENT_SECRET,
        "grant_type": "client_credentials"
    }
)
token = auth_response.json()["access_token"]

headers = {
    "Client-ID": CLIENT_ID,
    "Authorization": f"Bearer {token}"
}

# Fetch top live streams
response = requests.get(
    "https://api.twitch.tv/helix/streams",
    headers=headers,
    params={"first": 20}
)

streams = response.json()["data"]
for stream in streams:
    print(f"{stream['user_name']}: {stream['title']} ({stream['viewer_count']} viewers)")

This works fine for basic queries. But the API has real limitations:

Rate limits: 800 requests per minute for most endpoints. Sounds generous until you're iterating over thousands of channels.
No historical data: The API only shows what's live right now. Past streams, deleted clips, and historical viewer counts aren't available.
No chat messages: The Helix API doesn't expose chat history. You'd need to connect via IRC or use EventSub websockets for real-time messages.
Pagination caps: Some endpoints limit the total results you can paginate through, even if more data exists.

Handling Pagination Properly

One of the most common mistakes when working with Twitch data is not handling pagination correctly. The API uses cursor-based pagination, and you need to follow the pagination.cursor field to get all results.

def fetch_all_clips(broadcaster_id, headers, max_pages=10):
    """Fetch clips with proper cursor-based pagination."""
    all_clips = []
    cursor = None

    for page in range(max_pages):
        params = {
            "broadcaster_id": broadcaster_id,
            "first": 100  # max per request
        }
        if cursor:
            params["after"] = cursor

        response = requests.get(
            "https://api.twitch.tv/helix/clips",
            headers=headers,
            params=params
        )
        data = response.json()
        clips = data.get("data", [])

        if not clips:
            break

        all_clips.extend(clips)
        cursor = data.get("pagination", {}).get("cursor")

        if not cursor:
            break

        print(f"Page {page + 1}: fetched {len(clips)} clips (total: {len(all_clips)})")

    return all_clips

# Usage
clips = fetch_all_clips("12345678", headers)
print(f"Total clips collected: {len(clips)}")

Note the first: 100 parameter — that's the maximum page size for most Twitch endpoints. If you don't set it, you'll get 20 results per page by default, meaning 5x more API calls for the same data.

Scraping Twitch Chat Data

Chat data is where things get interesting — and where the official API falls short. Twitch chat runs over IRC (yes, still), and you can connect to it programmatically:

import socket
import re

def connect_to_chat(channel, oauth_token, nickname="justinfan67890"):
    """Connect to Twitch IRC chat. Use justinfan* for anonymous read-only."""
    sock = socket.socket()
    sock.connect(("irc.chat.twitch.tv", 6667))

    sock.send(f"PASS oauth:{oauth_token}\r\n".encode("utf-8"))
    sock.send(f"NICK {nickname}\r\n".encode("utf-8"))
    sock.send(f"JOIN #{channel}\r\n".encode("utf-8"))

    # Request additional metadata
    sock.send("CAP REQ :twitch.tv/tags\r\n".encode("utf-8"))

    return sock

def parse_messages(sock, duration_seconds=60):
    """Read chat messages for a given duration."""
    import time
    messages = []
    start = time.time()

    while time.time() - start < duration_seconds:
        try:
            sock.settimeout(5.0)
            response = sock.recv(4096).decode("utf-8")

            if response.startswith("PING"):
                sock.send("PONG :tmi.twitch.tv\r\n".encode("utf-8"))
                continue

            # Parse PRIVMSG lines
            for line in response.split("\r\n"):
                match = re.search(r":(.+?)!.+?PRIVMSG #(\w+) :(.+)", line)
                if match:
                    messages.append({
                        "user": match.group(1),
                        "channel": match.group(2),
                        "message": match.group(3),
                        "timestamp": time.time()
                    })
        except socket.timeout:
            continue

    return messages

# Anonymous connection (read-only, no OAuth needed)
sock = connect_to_chat("xqc", "", "justinfan67890")
messages = parse_messages(sock, duration_seconds=30)
print(f"Captured {len(messages)} messages")

The justinfan trick is worth noting — Twitch allows anonymous read-only connections to any public chat channel using any nickname starting with justinfan. No authentication required. This is useful for data collection since you don't need to manage OAuth tokens for read-only access.

For production use, consider using EventSub websockets instead of raw IRC, as Twitch has been pushing developers toward that protocol.

Collecting Streamer Statistics and Channel Metadata

Beyond live streams and clips, you often want to build a profile of a streamer: follower count, stream schedule, most-played games, and channel description. Here's how to pull that together:

def get_channel_profile(username, headers):
    """Build a comprehensive channel profile from multiple API endpoints."""

    # Get user info
    user_resp = requests.get(
        "https://api.twitch.tv/helix/users",
        headers=headers,
        params={"login": username}
    )
    user = user_resp.json()["data"][0]
    user_id = user["id"]

    # Get channel info
    channel_resp = requests.get(
        "https://api.twitch.tv/helix/channels",
        headers=headers,
        params={"broadcaster_id": user_id}
    )
    channel = channel_resp.json()["data"][0]

    # Get follower count
    followers_resp = requests.get(
        "https://api.twitch.tv/helix/channels/followers",
        headers=headers,
        params={"broadcaster_id": user_id, "first": 1}
    )
    follower_count = followers_resp.json().get("total", 0)

    # Get recent stream schedule
    schedule_resp = requests.get(
        "https://api.twitch.tv/helix/schedule",
        headers=headers,
        params={"broadcaster_id": user_id}
    )

    return {
        "username": user["display_name"],
        "user_id": user_id,
        "description": user["description"],
        "profile_image": user["profile_image_url"],
        "created_at": user["created_at"],
        "game": channel["game_name"],
        "title": channel["title"],
        "language": channel["broadcaster_language"],
        "follower_count": follower_count,
        "schedule": schedule_resp.json().get("data", {})
    }

profile = get_channel_profile("pokimane", headers)
for key, value in profile.items():
    print(f"{key}: {value}")

This requires multiple API calls per channel, which is where rate limits start to matter. If you need profiles for hundreds or thousands of channels, you'll want to batch requests and add proper rate limit handling.

When Web Scraping Makes More Sense

There are legitimate scenarios where scraping Twitch's web interface is the better option:

VOD metadata: Details about past broadcasts that the API doesn't expose well
Community data: Sub counts, emote lists, and community points that have limited API support
Discovery data: Browse page rankings, recommended channels, category trending data
Historical snapshots: Periodic captures of what the platform looks like at a point in time

For basic web scraping, you can use BeautifulSoup or Playwright:

from playwright.sync_api import sync_playwright
import json

def scrape_category_streams(category_slug):
    """Scrape live streams from a Twitch category page."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        page.goto(f"https://www.twitch.tv/directory/game/{category_slug}")
        page.wait_for_selector('[data-target="directory-first-item"]', timeout=10000)

        # Scroll to load more results
        for _ in range(3):
            page.evaluate("window.scrollBy(0, 1000)")
            page.wait_for_timeout(1500)

        # Extract stream cards
        streams = page.evaluate("""
            () => {
                const cards = document.querySelectorAll('article');
                return Array.from(cards).map(card => ({
                    title: card.querySelector('h3')?.textContent || '',
                    streamer: card.querySelector('[data-a-target="preview-card-channel-link"]')?.textContent || '',
                    viewers: card.querySelector('.tw-media-card-stat')?.textContent || ''
                }));
            }
        """)

        browser.close()
        return streams

streams = scrape_category_streams("League-of-Legends")
for s in streams:
    print(f"{s['streamer']}: {s['title']} - {s['viewers']}")

Note that Twitch uses React with heavy client-side rendering, so simple HTTP requests won't work — you need a headless browser.

The Practical Solution: Using Pre-built Scrapers

Building and maintaining a Twitch scraper is doable, but dealing with anti-bot measures, layout changes, and session management is ongoing work. If you need Twitch data regularly, it's worth considering pre-built tools.

For example, the Twitch Scraper on Apify handles all of this — authentication, pagination, headless browser management, and structured output. You feed it channel names or categories and get back clean JSON with stream data, clips, and channel metadata. It runs in the cloud, so you don't need to manage infrastructure.

The advantage of this approach is that someone else maintains the scraper when Twitch changes their frontend. The tradeoff is cost — though for most use cases, the free tier is enough to get started.

Respecting Twitch's Terms and Rate Limits

A few important notes on the legal and ethical side:

Twitch's API terms require you to display attribution when showing their data publicly
Rate limits exist for a reason — respect them. Implement exponential backoff:

import time

def request_with_backoff(url, headers, params, max_retries=5):
    """Make API request with exponential backoff on rate limits."""
    for attempt in range(max_retries):
        response = requests.get(url, headers=headers, params=params)

        if response.status_code == 200:
            return response.json()
        elif response.status_code == 429:
            # Rate limited — check Ratelimit-Reset header
            reset_time = int(response.headers.get("Ratelimit-Reset", time.time() + 60))
            wait = max(reset_time - time.time(), 2 ** attempt)
            print(f"Rate limited. Waiting {wait:.0f}s...")
            time.sleep(wait)
        else:
            response.raise_for_status()

    raise Exception("Max retries exceeded")

Don't scrape personal data (real names, emails) without a legitimate basis
Cache aggressively — if the data doesn't change every minute, don't fetch it every minute

Conclusion

Twitch data collection in 2026 is a mix of official API usage and targeted scraping. The Helix API covers most basic needs — live streams, clips, user profiles — but falls short on historical data, chat logs, and discovery metrics.

For most projects, I'd recommend starting with the official API for structured data, adding IRC or EventSub connections for real-time chat, and using web scraping (or a managed scraper like the Twitch Scraper on Apify) for everything else.

The key is to pick the right tool for each data point rather than trying to force one approach for everything. And whatever approach you use, build in rate limiting and caching from day one — your future self will thank you.

What Twitch data are you working with? Drop a comment below — I'm always curious to hear about creative uses of streaming platform data.

DEV Community