Devil Scrapes

Posted on Jun 5

Training a Twitch chat toxicity classifier on real VOD data at scale

#machinelearning #python #webscraping #datascience

Quick answer: Twitch has no public API for VOD chat replay. To build a Twitch toxicity classifier dataset you walk the internal VideoCommentsByOffsetOrCursor GraphQL endpoint at scale — the same one the web player uses. The Devil Scrapes Twitch VOD Chat Archive Actor does that for $0.001 per message (~$1.05 per 1,000), returning the structured fields — message_fragments, badges, is_subscriber — that make classifier features actually useful.

If you maintain a mod-bot (StreamElements, Nightbot, Streamlabs, or custom), or if you are an ML engineer building a Twitch-native toxicity model, your training data problem is the same: you need labeled-able chat messages at scale from real VODs, with enough context per row to build signal-rich features. This post walks the full pipeline — pulling the data, loading it into pandas, training a baseline TF-IDF + logistic-regression classifier, and sketching the upgrade path to a transformer.

Does Twitch have an API for chat training data? 🔎

Not in any useful sense. The Twitch Helix API exposes live IRC chat via EventSub and the Chat & Messaging endpoints, but it has no endpoint for VOD chat replay — the historical timestamped record of a past broadcast. That data exists (you can watch it in the VOD player), but the only programmatic surface for it is the internal VideoCommentsByOffsetOrCursor persisted GraphQL query.

Walking that endpoint reliably is a job in itself. Twitch inspects TLS fingerprints from incoming requests — Python's requests or httpx produce a ClientHello that no real browser sends, and the server responds with a 403 before it reads the body. Past roughly 10,000 messages on a single IP, Twitch's rate-limiting kicks in hard. The cursor-based pagination mode triggers an integrity-check challenge that needs a live browser to solve. Offset-based pagination avoids it, but only if you know to use it before you start coding.

We absorb all of that. The Actor rotates through Chrome, Firefox, and Safari TLS fingerprints via curl-cffi, threads residential proxies with fresh session IDs on each block, retries with exponential backoff on 408 / 429 / 5xx, and pages exclusively by content offset to sidestep the integrity check. The result is a clean dataset of typed rows you can load straight into pandas.

Why these fields matter for classifier training 🧪

Not all chat APIs return the same structure. The fields the Actor returns were chosen with feature engineering in mind:

message_text — the plain-text body of the message with emote shortcodes preserved as literal text (e.g. "PogChamp PogChamp OMEGALUL"). This is your label target and your primary text feature.

message_fragments — a structured array of {type, text, emote_id} objects. Type is either "text" or "emote". This matters because emotes carry semantic weight a TF-IDF tokenizer cannot capture from their shortcode text alone. An "emote" fragment with emote_id lets you treat emotes as a distinct token type, deduplicate their representation, or embed them separately. Spam runs often consist almost entirely of emote fragments; that ratio is a cheap feature.

badges — an array of {set_id, version} objects representing the user's active chat badges. A user carrying a moderator badge, a broadcaster badge, or a vip badge is structurally different from a first-time chatter — and their messages should be weighted differently in your training set. A model that does not distinguish a moderator warning from a random user saying the same thing is a weaker model.

is_subscriber — a boolean convenience flag derived from the badges array. Subscribers are users who have paid for channel membership; their base rate of toxic behavior differs from non-subscribers. This is a fast binary feature your model can use without parsing the full badges array.

message_offset_seconds — the message's position in the VOD timeline in seconds. Toxic spikes correlate with in-stream events: a bad play, a controversial opinion, a raid. Including offset in your labeling pass lets you sample across the full timeline rather than front-loading training data from the first ten minutes.

commenter_id and commenter_login — stable user identifiers. These let you group messages per user for user-level features (message frequency, per-user historical toxic rate) or for deduplicating known-spam accounts from training positives.

Step 1 — Pull the training data

You need apify-client installed (pip install apify-client pandas scikit-learn). Get a free Apify API token at apify.com — no card required, every account starts with $5 of credit.

The call below targets three VODs by ID and caps at 5,000 messages per VOD. At $0.001 per message plus the $0.05 actor-start, 15,000 messages costs $15.05.

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/twitch-vod-chat-archive").call(
    run_input={
        "vodIds": [
            "2773625679",
            "2756421083",
            "2741897234"
        ],
        "maxMessagesPerVod": 5000,
        "startOffsetSeconds": 0,
        "proxyConfiguration": {
            "useApifyProxy": True,
            "apifyProxyGroups": ["RESIDENTIAL"]
        }
    }
)

items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
print(f"Pulled {len(items)} messages")

For a larger training corpus — say 100 VODs from a mix of channels — set maxRecentVods on channelLogin mode instead of listing IDs:

run = client.actor("DevilScrapes/twitch-vod-chat-archive").call(
    run_input={
        "channelLogin": "shroud",
        "maxRecentVods": 50,
        "maxMessagesPerVod": 10000,
        "proxyConfiguration": {
            "useApifyProxy": True,
            "apifyProxyGroups": ["RESIDENTIAL"]
        }
    }
)

That gives you up to 500,000 messages per channel in a single run. At $0.001/message that is ~$500.05 for the full 500k — but the free $5 trial credit covers 4,950 messages, enough to validate your pipeline before committing.

Step 2 — Load into pandas and build features

import pandas as pd

df = pd.DataFrame(items)

# Compute emote ratio — useful spam feature
def emote_ratio(fragments):
    if not fragments:
        return 0.0
    emote_count = sum(1 for f in fragments if f.get("type") == "emote")
    return emote_count / len(fragments)

df["emote_ratio"] = df["message_fragments"].apply(emote_ratio)

# Extract badge sets as a frozenset for grouping
def badge_set(badges):
    return frozenset(b["set_id"] for b in badges) if badges else frozenset()

df["badge_set"] = df["badges"].apply(badge_set)

# is_moderator / is_broadcaster convenience columns
df["is_moderator"] = df["badge_set"].apply(lambda s: "moderator" in s)
df["is_broadcaster"] = df["badge_set"].apply(lambda s: "broadcaster" in s)

# Messages per user — frequency signal
msg_counts = df.groupby("commenter_id")["message_id"].count().rename("user_msg_count")
df = df.merge(msg_counts, on="commenter_id", how="left")

print(df[["message_text", "is_subscriber", "is_moderator", "emote_ratio", "user_msg_count"]].head())

Sample output row from a real VOD scrape (channel: shroud, toxic content masked):

{
  "vod_id": "2773625679",
  "vod_title": "never played forza but i definitely have a drivers license so it should be easy",
  "channel_login": "shroud",
  "message_id": "1292e052-0561-4db5-86c7-adfc4556d628",
  "message_offset_seconds": 12,
  "posted_at": "2026-05-16T18:42:35.297Z",
  "commenter_id": "142680597",
  "commenter_login": "tabrexs",
  "commenter_display_name": "tabrexs",
  "message_text": "PewPewPew",
  "message_fragments": [
    {
      "type": "emote",
      "text": "PewPewPew",
      "emote_id": "emotesv2_587405136a8147148c77df74baaa1bf4"
    }
  ],
  "user_color": "#DAA520",
  "badges": [],
  "is_subscriber": false,
  "scraped_at": "2026-05-16T19:00:00Z"
}

Step 3 — Label and train a baseline classifier

For a first iteration, label toxic/benign manually on a sample and train a TF-IDF + logistic-regression baseline. This is fast to iterate on and gives you a performance floor to beat with transformer fine-tuning later.

Important framing note for the labeling pass: toxic labels in mod-tool training are typically defined by the channel's own moderation rules, not a universal taxonomy. What a family-friendly channel flags as toxic differs from a gaming-focused one. Build your label schema per-channel or use a community standard like Perspective API categories for initial seeding.

Do not include known-slur text in your labeled examples file in plaintext — store them masked (e.g. [masked slur]) and apply transformations at load time. The mod community, and any team reviewing your training data, will thank you.

import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
import numpy as np

# Load your labeled subset (human annotations: {message_id: 0 or 1})
# 0 = benign, 1 = toxic / spam
with open("labels.json") as f:
    labels = json.load(f)  # {"message_id_1": 0, "message_id_2": 1, ...}

labeled_df = df[df["message_id"].isin(labels)].copy()
labeled_df["label"] = labeled_df["message_id"].map(labels)

# Text feature — message_text is the primary signal
X_text = labeled_df["message_text"].fillna("")
y = labeled_df["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X_text, y, test_size=0.2, random_state=42, stratify=y
)

# Baseline: TF-IDF unigrams + bigrams, logistic regression
pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),
        max_features=20000,
        sublinear_tf=True
    )),
    ("clf", LogisticRegression(
        C=1.0,
        class_weight="balanced",  # important: toxic is a minority class
        max_iter=1000
    )),
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred, target_names=["benign", "toxic"]))

Adding structural features alongside TF-IDF:

The text pipeline above ignores emote_ratio, is_subscriber, and user_msg_count. To include them in the same model, combine sparse TF-IDF with a dense feature matrix:

from scipy.sparse import hstack
from sklearn.preprocessing import StandardScaler

# Dense features
dense_features = labeled_df[["emote_ratio", "is_subscriber", "is_moderator", "user_msg_count"]].fillna(0).values

X_train_dense, X_test_dense = (
    dense_features[labeled_df.index.isin(X_train.index)],
    dense_features[labeled_df.index.isin(X_test.index)],
)

# Fit TF-IDF on train split only
tfidf = TfidfVectorizer(ngram_range=(1, 2), max_features=20000, sublinear_tf=True)
X_train_sparse = tfidf.fit_transform(X_train)
X_test_sparse = tfidf.transform(X_test)

# Combine
X_train_combined = hstack([X_train_sparse, X_train_dense])
X_test_combined = hstack([X_test_sparse, X_test_dense])

clf = LogisticRegression(C=1.0, class_weight="balanced", max_iter=1000)
clf.fit(X_train_combined, y_train)

print(classification_report(y_test, clf.predict(X_test_combined), target_names=["benign", "toxic"]))

In practice the emote_ratio column tends to lift spam precision noticeably — pure-emote spam messages produce a ratio near 1.0 and a short message_text length, a combination TF-IDF alone does not capture well.

Step 4 — Upgrade path to a transformer 🤗

The baseline above will plateau around 75–82% F1 on a well-balanced Twitch dataset. The main failure modes are:

Context-dependent toxicity (a word that is benign in one sentence, toxic in another)
Emote-heavy messages where the semantic meaning is entirely in the emote, not the text
Cross-channel domain shift (a model trained on one streamer's community generalizes poorly)

The upgrade path is to fine-tune a pre-trained model on your labeled data. cardiffnlp/twitter-roberta-base-offensive is a strong starting checkpoint for chat-style text — it was trained on social-media toxicity and transfers better to Twitch than a generic BERT.

# Pseudocode — full fine-tuning loop depends on your GPU setup
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

model_name = "cardiffnlp/twitter-roberta-base-offensive"
tokenizer = AutoTokenizer.from_pretrained(model_name)

hf_dataset = Dataset.from_pandas(labeled_df[["message_text", "label"]].rename(columns={"message_text": "text"}))

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

tokenized = hf_dataset.map(tokenize, batched=True)
# ... standard Trainer setup with TrainingArguments, compute_metrics, etc.

The message_fragments field opens a further avenue: treat emote tokens as special tokens added to the tokenizer vocabulary (one token per emote_id), then let the model learn emote embeddings jointly with text. This is not a weekend project, but it is the difference between a model that handles OMEGALUL as an unknown token and one that learns it signals laughter.

How much data do you actually need? 💰

The plan answers the pricing question directly. At $0.001/message:

Pull size	Cost	Labeled examples (assuming 10% manual label rate)
10,000 messages	$10.05	~1,000 labeled rows
50,000 messages	$50.05	~5,000 labeled rows
100,000 messages	$100.05	~10,000 labeled rows

For a TF-IDF baseline, 1,000–5,000 labeled examples is workable if your class balance is reasonable. For transformer fine-tuning, 5,000+ labeled examples per class is the typical floor for stable results. You get to the free trial's 4,950 messages before spending a cent — that is enough to validate your feature extraction pipeline end-to-end before scaling up.

The full Twitch chat scraper guide covers the broader use-case landscape (esports analytics, post-broadcast review, channel back-catalog mode) if you want context beyond classifier training: Twitch Chat Scraper: export any VOD's full chat replay for $1.05/1K.

FAQ ❓

Can I use this for StreamElements / Nightbot rule testing?

Yes. Pull historical chat from VODs where you know toxic events occurred, then replay the message_text values through your bot's filter rules in a test harness. The badges and is_subscriber fields let you simulate the trust-level rules most bots implement (moderators and subscribers often get different thresholds).

Does the Actor return deleted or banned messages?

No. The public chat-replay endpoint does not expose moderator actions — bans, timeouts, or the content of deleted messages. Deleted messages may appear as a <message deleted> placeholder or may not appear at all, depending on when they were removed relative to the archive write. Your toxicity model should treat the absence of a message ID from a later snapshot as a soft toxic signal, not a hard one.

How do I avoid training on bot messages?

Filter on user_msg_count — accounts that sent more than N messages in the same VOD are candidate spam bots. You can also filter out users whose message_text is identical across multiple rows in the same VOD (copy-paste spam). The Actor returns the stable commenter_id so grouping is straightforward.

Is this legal / TOS-compliant?

Twitch's public VOD chat replay is presented to any logged-out visitor; this Actor retrieves only what the VOD player shows anonymously, at a paced rate. We are not affiliated with Twitch. Check your own jurisdiction and use case. The Twitch Terms of Service governs what you may do with the collected data — notably the prohibition on commercial use of data in ways that compete directly with Twitch.

The Actor is live at apify.com/DevilScrapes/twitch-vod-chat-archive. Free $5 trial credit, no credit card. Pull a few thousand messages from a channel you know, run through the pipeline above, and you will have a working baseline before the end of the day. Leave a question in the comments if you hit a snag — the message_fragments / feature-engineering section in particular has sharp edges worth talking through.

Built by Devil Scrapes — we do the dirty work so your dataset stays clean. 😈

DEV Community