IPFoxy

Posted on Jun 23

How to Scrape TikTok Comments with Python: A Step-by-Step Tutorial

#python #programming #tiktok

TikTok is one of the world's largest short-video social platforms, with over 1 billion daily active users. Its content spans entertainment, education, e-commerce, and many other categories. Within this massive content ecosystem, the comment section is often even more valuable than the videos themselves—real user feedback, sentiment trends, and purchase intent are all reflected in comment data.
However, TikTok does not provide a public comment API, and its pages rely heavily on JavaScript-based dynamic rendering, making data collection more challenging. In this tutorial, you'll learn how to use Python to scrape TikTok comment data step by step.

I. Core Use Cases of TikTok Comment Data

Before getting started, let's first answer an important question: What can TikTok comment data be used for?
• Competitor Analysis: By collecting comments from popular competitor videos, you can quickly identify customer pain points and feature requests, helping shape a differentiated content strategy.
• Viral Content Research: Highly liked comments often reveal what resonates with audiences. Analyzing keyword distributions in viral video comments can help content teams identify the next growth opportunity.
• Brand Monitoring and Reputation Management: Continuous sentiment analysis of TikTok comments related to your brand enables real-time monitoring of public opinion and supports faster decision-making.
• User Profiling: Comment content, engagement frequency, IP location, and other information provide valuable raw data for building detailed audience profiles.
Based on these scenarios, the goal of this tutorial is to build a lightweight TikTok data monitoring tool with Python and use proxy technology to overcome common access limitations, making TikTok data collection more stable and efficient.

II. Python TikTok Comment Scraper: Complete Step-by-Step Tutorial

After understanding the use cases, let's move on to implementation. The entire process consists of eight steps, each with runnable code examples.

1. Prepare Your Environment and Install Dependencies

Before scraping TikTok comments, install the following Python libraries:

pip install requests pandas jieba wordcloud matplotlib

Library overview:
• requests: Send HTTP requests
• pandas: Data cleaning and CSV storage
• jieba: Chinese word segmentation for keyword analysis
• wordcloud + matplotlib: Word cloud visualization
Recommended project structure:

TikTokCommentScraper/
├── data/           # Scraped data
├── scripts/        # Scraper scripts
├── logs/           # Runtime logs
├── requirements.txt
└── README.md

2. Analyze TikTok's Comment Loading Mechanism

Before writing code, inspect TikTok's comment API using your browser's developer tools (F12).
Open a target video page, switch to the Network tab, and filter XHR requests. Look for requests similar to /api/comment/list/. These endpoints return structured JSON data containing comment text, like counts, user information, IP locations, and more.
Since TikTok comments are dynamically rendered with JavaScript, traditional static page requests cannot retrieve the data directly. Two key parameters are:
• cursor: Controls pagination offset (typically increments by 20)
• aweme_id: The target video ID
These parameters are essential for paginated comment collection.

3. Build Requests and Retrieve Comment Data

After copying the request headers, cookies, and parameters from the Network panel, you can construct requests as follows:

import requests

# Recommended: Use Session
session = requests.Session()

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Referer": "https://www.tiktok.com/",
    "Accept": "application/json, text/plain, */*",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "Cookie": "YOUR_COOKIE"
}

def get_comments(aweme_id, cursor=0, count=20):
    url = "https://www.tiktok.com/api/comment/list/"

    params = {
        "aweme_id": aweme_id,
        "cursor": cursor,
        "count": count,
        # "X-Bogus": "xxxxxx",
        # "_signature": "xxxxxx",
    }

    try:
        response = session.get(url, headers=HEADERS, params=params, timeout=10)
        response.raise_for_status()
        return response.json()
    except Exception as e:
        print(f"Request failed: {e}")
        return None

Note: Cookies expire periodically and must be refreshed. Storing cookies in a separate configuration file is recommended for easier management.

4. Configure Proxies for More Stable TikTok Data Collection

Frequent requests to the same endpoint can trigger TikTok's anti-bot mechanisms, resulting in IP bans and interrupted scraping tasks. To improve collection stability, many professional data teams use rotating proxy pools from providers such as IPFoxy Proxies.
The following example demonstrates how to configure a rotating residential proxy from IPFoxy Proxies in Python.
• Obtain Proxy Credentials
Create a rotating residential proxy configuration in IPFoxy Proxies by selecting the target location, protocol type, session rotation mode, and output format. Once generated, you'll receive proxy connection details.

• Configure the Proxy in Python
Suppose your proxy credentials are:
username:password@gate-us-ipfoxy.io:58688

Use the following code:

import urllib.request

if __name__ == '__main__':
    proxy = urllib.request.ProxyHandler({
        'https': 'username:password@gate-us-ipfoxy.io:58688',
        'http': 'username:password@gate-us-ipfoxy.io:58688',
    })

    opener = urllib.request.build_opener(
        proxy,
        urllib.request.HTTPHandler
    )

    urllib.request.install_opener(opener)

    content = urllib.request.urlopen(
        'http://www.ip-api.com/json'
    ).read()

    print(content)

If the returned IP address differs from your local IP, the proxy has been configured successfully.

In addition to proxy rotation, adding random delays such as time.sleep(random.uniform(1.5, 3.5)) between requests can help simulate natural user behavior and significantly reduce blocking risks.
5. Parse TikTok Comment JSON Data
The primary comment fields are located within the comments array of the JSON response.
from datetime import datetime

def parse_comments(json_data):
    if not json_data or not isinstance(json_data, dict):
        return []

    comment_list = json_data.get("comments") or []
    results = []

    for item in comment_list:
        raw_time = item.get("create_time")
        formatted_time = ""

        if raw_time:
            try:
                formatted_time = datetime.fromtimestamp(
                    int(raw_time)
                ).strftime('%Y-%m-%d %H:%M:%S')
            except Exception:
                formatted_time = str(raw_time)

        results.append({
            "comment_id": item.get("cid"),
            "text": item.get("text"),
            "like_count": item.get("digg_count", 0),
            "reply_count": item.get("reply_comment_total", 0),
            "create_time": formatted_time,
            "ip_location": item.get("ip_label", "Unknown"),
            "user_name": item.get("user", {}).get(
                "nickname",
                "Deleted User"
            ),
        })

    return results

Commonly extracted fields include comment ID, comment text, like count, publish time, IP location, username, and reply count.

6. Save Comment Data to CSV

Use pandas to store parsed data while supporting incremental writes.

import pandas as pd
import os

def save_to_csv(data, filepath="data/comments.csv"):
    if not data:
        print("No comment data received.")
        return

    dir_name = os.path.dirname(filepath)

    if dir_name and not os.path.exists(dir_name):
        os.makedirs(dir_name)

    df = pd.DataFrame(data)

    if os.path.exists(filepath):
        df.to_csv(
            filepath,
            mode="a",
            header=False,
            index=False,
            encoding="utf_8_sig"
        )
    else:
        df.to_csv(
            filepath,
            mode="w",
            header=True,
            index=False,
            encoding="utf_8_sig"
        )

    print(
        f"Successfully saved {len(data)} comments."
    )

The utf_8_sig encoding prevents character corruption when opening CSV files in Excel.

7. Automate Pagination to Collect More TikTok Comments

A single request usually returns 20 comments. Incrementing the cursor value enables large-scale collection.

import time
import random

def scrape_all_comments(
    aweme_id,
    max_pages=50
):
    all_comments = []
    cursor = 0

    filepath = (
        f"data/comments_{aweme_id}.csv"
    )

    print(
        f"Scraping comments for video {aweme_id}"
    )

    for page in range(max_pages):

        data = get_comments(
            aweme_id,
            cursor=cursor
        )

        if not data or not isinstance(data, dict):
            break

        comments = parse_comments(data)

        if not comments:
            break

        all_comments.extend(comments)

        save_to_csv(
            comments,
            filepath=filepath
        )

        has_more = data.get("has_more")

        if not has_more:
            break

        next_cursor = data.get("cursor")

        if next_cursor is not None:
            cursor = next_cursor
        else:
            cursor += 20

        time.sleep(
            random.uniform(2.0, 4.5)
        )

    print(
        f"Finished. Total comments: {len(all_comments)}"
    )

    return all_comments

The has_more field determines whether additional comments are available, preventing unnecessary requests.

8. Analyze Keywords and Generate a Word Cloud

After collecting comments, you can perform keyword analysis and visualization.

import pandas as pd
import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter

def analyze_keywords(
    csv_path="data/comments.csv",
    top_n=20
):
    try:
        df = pd.read_csv(
            csv_path,
            encoding="utf_8_sig"
        )
    except FileNotFoundError:
        return

    comments_text = " ".join(
        df["text"]
        .dropna()
        .astype(str)
        .tolist()
    )

    stopwords = {
        "的", "了", "在", "是", "我",
        "你", "他", "她", "它"
    }

    cleaned_words = [
        w for w in jieba.cut(comments_text)
        if len(w) > 1 and w not in stopwords
    ]

    freq = Counter(
        cleaned_words
    ).most_common(top_n)

    wordcloud_input_text = " ".join(
        cleaned_words
    )

    wc = WordCloud(
        font_path="simhei.ttf",
        width=800,
        height=400,
        background_color="white",
        max_words=100
    ).generate(wordcloud_input_text)

    plt.figure(figsize=(10, 5))
    plt.imshow(
        wc,
        interpolation="bilinear"
    )
    plt.axis("off")
    plt.show()

Word clouds provide a quick visual overview of high-frequency keywords and can help identify user interests, content opportunities, and operational insights.

III. Strategies to Improve TikTok Scraping Efficiency and Stability

After the basic workflow is running, consider the following optimizations for production environments:

1.Request Rate Control

Beyond random delays, implement algorithms such as token buckets to precisely control QPS and avoid bursts of requests.

2.Automatic Cookie Refresh

Expired cookies are one of the most common causes of scraping interruptions. Monitor cookie validity and trigger alerts or refresh workflows before expiration.

3.Retry Mechanisms

Implement exponential backoff retry logic. For example, retry failed requests up to three times, doubling the wait time after each failure.

4.Data Deduplication

Duplicate comments may occasionally appear during pagination. Use comment_id as a unique key before writing records to CSV.

5.Logging

Use Python's logging module to record status codes, collected record counts, and exceptions for easier troubleshooting and progress tracking.

IV. FAQ

Q: Can I scrape TikTok comments without a TikTok account?
A: Some endpoints may be accessible without logging in, but success rates and stability are generally lower. Using a valid account cookie is recommended.
Q: Can I scrape comments from multiple videos simultaneously?
A: Yes. You can use Python's concurrent.futures module for concurrent collection. Be sure to manage overall request rates carefully to avoid triggering anti-bot systems.
Q: How can I collect replies (nested comments)?
A: Pass the corresponding comment_id parameter to the reply endpoint. The pagination logic is essentially the same as for top-level comments and can be reused directly.

V. Conclusion

This tutorial demonstrates the eight essential steps required to scrape TikTok comments with Python, including dependency installation, API analysis, proxy configuration, data parsing, CSV storage, pagination, and keyword analysis.
For real-world projects, stability is just as important as functionality. Implementing rate control, cookie management, retry mechanisms, and proxy rotation can significantly improve collection reliability.
Ultimately, the value of TikTok comment data lies in how it is used. Combining comment keywords with sentiment analysis, audience profiling, and content strategy enables teams to transform raw data into actionable business insights.

DEV Community