agenthustler

Posted on Mar 26

How to Scrape Reddit: Posts, Comments, and Subreddit Data in 2026

#python #tutorial #webdev #programming

How to Scrape Reddit: Posts, Comments, and Subreddit Data in 2026

Reddit is one of the richest sources of user-generated content on the web. With the API changes of 2023-2024, scraping Reddit requires updated approaches. Here is how to collect Reddit data effectively in 2026.

The Landscape in 2026

Reddit's official API now has strict rate limits and pricing for commercial use. However, for research and personal projects, there are still viable approaches:

Official API (free tier): 100 requests/minute, good for small projects
Old Reddit HTML: Still accessible, lighter pages
Public JSON endpoints: Append .json to most Reddit URLs

Setup

pip install requests pandas praw

Method 1: Reddit JSON Endpoints

The simplest approach — no API key needed:

import requests
import pandas as pd
import time

def scrape_subreddit(subreddit, limit=100):
    posts = []
    after = None
    headers = {"User-Agent": "DataCollector/1.0"}

    while len(posts) < limit:
        url = f"https://old.reddit.com/r/{subreddit}/hot.json"
        params = {"limit": 25, "after": after}
        response = requests.get(url, headers=headers, params=params)

        if response.status_code != 200:
            break

        data = response.json()
        children = data["data"]["children"]

        if not children:
            break

        for child in children:
            post = child["data"]
            posts.append({
                "title": post["title"],
                "author": post.get("author", "[deleted]"),
                "score": post["score"],
                "num_comments": post["num_comments"],
                "created_utc": post["created_utc"],
                "url": post["url"],
                "selftext": post.get("selftext", "")[:500],
                "permalink": f"https://reddit.com{post['permalink']}"
            })

        after = data["data"].get("after")
        if not after:
            break
        time.sleep(2)

    return pd.DataFrame(posts[:limit])

df = scrape_subreddit("python", limit=200)
print(f"Collected {len(df)} posts")
print(df[["title", "score", "num_comments"]].head(10))

Method 2: Using PRAW (Official API)

For more reliable access, use Reddit's official library:

import praw

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_SECRET",
    user_agent="DataCollector/1.0"
)

def get_subreddit_data(subreddit_name, sort="hot", limit=100):
    subreddit = reddit.subreddit(subreddit_name)
    posts = []

    sort_method = getattr(subreddit, sort)
    for post in sort_method(limit=limit):
        posts.append({
            "title": post.title,
            "author": str(post.author),
            "score": post.score,
            "upvote_ratio": post.upvote_ratio,
            "num_comments": post.num_comments,
            "created_utc": post.created_utc,
            "url": post.url,
            "is_self": post.is_self,
            "flair": post.link_flair_text
        })

    return pd.DataFrame(posts)

df = get_subreddit_data("machinelearning", sort="top", limit=500)

Scraping Comments

Comments contain the real value — opinions, recommendations, and discussions:

def get_post_comments(post_url, depth=3):
    url = post_url.rstrip("/") + ".json"
    headers = {"User-Agent": "DataCollector/1.0"}
    response = requests.get(url, headers=headers)
    data = response.json()

    comments = []

    def parse_comments(comment_data, level=0):
        if level >= depth:
            return
        if isinstance(comment_data, dict):
            body = comment_data.get("body", "")
            if body:
                comments.append({
                    "author": comment_data.get("author", "[deleted]"),
                    "body": body,
                    "score": comment_data.get("score", 0),
                    "level": level
                })
            replies = comment_data.get("replies", "")
            if isinstance(replies, dict):
                children = replies.get("data", {}).get("children", [])
                for child in children:
                    parse_comments(child.get("data", {}), level + 1)

    listing = data[1]["data"]["children"]
    for item in listing:
        parse_comments(item.get("data", {}))

    return comments

Handling Anti-Scraping Measures

Reddit actively blocks scrapers. For production-scale collection, use ScraperAPI to handle IP rotation and rate limiting:

def reddit_via_proxy(subreddit):
    url = f"https://old.reddit.com/r/{subreddit}/hot.json"
    params = {"api_key": "YOUR_SCRAPERAPI_KEY", "url": url}
    return requests.get("https://api.scraperapi.com", params=params).json()

ThorData provides datacenter and residential proxies that work well with Reddit's detection systems. For monitoring your scraper's success rate, ScrapeOps offers real-time dashboards.

Analyzing Reddit Data

from datetime import datetime

df["engagement"] = df["score"] + df["num_comments"] * 2
top_posts = df.nlargest(10, "engagement")
print("Top engaged posts:")
for _, row in top_posts.iterrows():
    print(f"  [{row['score']}pts {row['num_comments']}comments] {row['title'][:80]}")

df["hour"] = pd.to_datetime(df["created_utc"], unit="s").dt.hour
best_hours = df.groupby("hour")["score"].mean().nlargest(5)
print(f"\nBest posting hours: {list(best_hours.index)}")

Ethical Guidelines

Respect Reddit's robots.txt and rate limits
Do not scrape private or quarantined subreddits
Anonymize user data if publishing research
Add delays between requests (minimum 2 seconds)
Consider using the official API for commercial projects

Follow for more Python web scraping guides updated for 2026!

DEV Community

How to Scrape Reddit: Posts, Comments, and Subreddit Data in 2026

How to Scrape Reddit: Posts, Comments, and Subreddit Data in 2026

The Landscape in 2026

Setup

Method 1: Reddit JSON Endpoints

Method 2: Using PRAW (Official API)

Scraping Comments

Handling Anti-Scraping Measures

Analyzing Reddit Data

Ethical Guidelines

Top comments (0)