DEV Community

agenthustler
agenthustler

Posted on

How to Scrape Reddit: Posts, Comments, and Subreddit Data in 2026

How to Scrape Reddit: Posts, Comments, and Subreddit Data in 2026

Reddit is one of the richest sources of user-generated content on the web. With the API changes of 2023-2024, scraping Reddit requires updated approaches. Here is how to collect Reddit data effectively in 2026.

The Landscape in 2026

Reddit's official API now has strict rate limits and pricing for commercial use. However, for research and personal projects, there are still viable approaches:

  • Official API (free tier): 100 requests/minute, good for small projects
  • Old Reddit HTML: Still accessible, lighter pages
  • Public JSON endpoints: Append .json to most Reddit URLs

Setup

pip install requests pandas praw
Enter fullscreen mode Exit fullscreen mode

Method 1: Reddit JSON Endpoints

The simplest approach — no API key needed:

import requests
import pandas as pd
import time

def scrape_subreddit(subreddit, limit=100):
    posts = []
    after = None
    headers = {"User-Agent": "DataCollector/1.0"}

    while len(posts) < limit:
        url = f"https://old.reddit.com/r/{subreddit}/hot.json"
        params = {"limit": 25, "after": after}
        response = requests.get(url, headers=headers, params=params)

        if response.status_code != 200:
            break

        data = response.json()
        children = data["data"]["children"]

        if not children:
            break

        for child in children:
            post = child["data"]
            posts.append({
                "title": post["title"],
                "author": post.get("author", "[deleted]"),
                "score": post["score"],
                "num_comments": post["num_comments"],
                "created_utc": post["created_utc"],
                "url": post["url"],
                "selftext": post.get("selftext", "")[:500],
                "permalink": f"https://reddit.com{post['permalink']}"
            })

        after = data["data"].get("after")
        if not after:
            break
        time.sleep(2)

    return pd.DataFrame(posts[:limit])

df = scrape_subreddit("python", limit=200)
print(f"Collected {len(df)} posts")
print(df[["title", "score", "num_comments"]].head(10))
Enter fullscreen mode Exit fullscreen mode

Method 2: Using PRAW (Official API)

For more reliable access, use Reddit's official library:

import praw

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_SECRET",
    user_agent="DataCollector/1.0"
)

def get_subreddit_data(subreddit_name, sort="hot", limit=100):
    subreddit = reddit.subreddit(subreddit_name)
    posts = []

    sort_method = getattr(subreddit, sort)
    for post in sort_method(limit=limit):
        posts.append({
            "title": post.title,
            "author": str(post.author),
            "score": post.score,
            "upvote_ratio": post.upvote_ratio,
            "num_comments": post.num_comments,
            "created_utc": post.created_utc,
            "url": post.url,
            "is_self": post.is_self,
            "flair": post.link_flair_text
        })

    return pd.DataFrame(posts)

df = get_subreddit_data("machinelearning", sort="top", limit=500)
Enter fullscreen mode Exit fullscreen mode

Scraping Comments

Comments contain the real value — opinions, recommendations, and discussions:

def get_post_comments(post_url, depth=3):
    url = post_url.rstrip("/") + ".json"
    headers = {"User-Agent": "DataCollector/1.0"}
    response = requests.get(url, headers=headers)
    data = response.json()

    comments = []

    def parse_comments(comment_data, level=0):
        if level >= depth:
            return
        if isinstance(comment_data, dict):
            body = comment_data.get("body", "")
            if body:
                comments.append({
                    "author": comment_data.get("author", "[deleted]"),
                    "body": body,
                    "score": comment_data.get("score", 0),
                    "level": level
                })
            replies = comment_data.get("replies", "")
            if isinstance(replies, dict):
                children = replies.get("data", {}).get("children", [])
                for child in children:
                    parse_comments(child.get("data", {}), level + 1)

    listing = data[1]["data"]["children"]
    for item in listing:
        parse_comments(item.get("data", {}))

    return comments
Enter fullscreen mode Exit fullscreen mode

Handling Anti-Scraping Measures

Reddit actively blocks scrapers. For production-scale collection, use ScraperAPI to handle IP rotation and rate limiting:

def reddit_via_proxy(subreddit):
    url = f"https://old.reddit.com/r/{subreddit}/hot.json"
    params = {"api_key": "YOUR_SCRAPERAPI_KEY", "url": url}
    return requests.get("https://api.scraperapi.com", params=params).json()
Enter fullscreen mode Exit fullscreen mode

ThorData provides datacenter and residential proxies that work well with Reddit's detection systems. For monitoring your scraper's success rate, ScrapeOps offers real-time dashboards.

Analyzing Reddit Data

from datetime import datetime

df["engagement"] = df["score"] + df["num_comments"] * 2
top_posts = df.nlargest(10, "engagement")
print("Top engaged posts:")
for _, row in top_posts.iterrows():
    print(f"  [{row['score']}pts {row['num_comments']}comments] {row['title'][:80]}")

df["hour"] = pd.to_datetime(df["created_utc"], unit="s").dt.hour
best_hours = df.groupby("hour")["score"].mean().nlargest(5)
print(f"\nBest posting hours: {list(best_hours.index)}")
Enter fullscreen mode Exit fullscreen mode

Ethical Guidelines

  • Respect Reddit's robots.txt and rate limits
  • Do not scrape private or quarantined subreddits
  • Anonymize user data if publishing research
  • Add delays between requests (minimum 2 seconds)
  • Consider using the official API for commercial projects

Follow for more Python web scraping guides updated for 2026!

Top comments (0)