How to Scrape Reddit: Posts, Comments, and Subreddit Data in 2026
Reddit is one of the richest sources of user-generated content on the web. With the API changes of 2023-2024, scraping Reddit requires updated approaches. Here is how to collect Reddit data effectively in 2026.
The Landscape in 2026
Reddit's official API now has strict rate limits and pricing for commercial use. However, for research and personal projects, there are still viable approaches:
- Official API (free tier): 100 requests/minute, good for small projects
- Old Reddit HTML: Still accessible, lighter pages
-
Public JSON endpoints: Append
.jsonto most Reddit URLs
Setup
pip install requests pandas praw
Method 1: Reddit JSON Endpoints
The simplest approach — no API key needed:
import requests
import pandas as pd
import time
def scrape_subreddit(subreddit, limit=100):
posts = []
after = None
headers = {"User-Agent": "DataCollector/1.0"}
while len(posts) < limit:
url = f"https://old.reddit.com/r/{subreddit}/hot.json"
params = {"limit": 25, "after": after}
response = requests.get(url, headers=headers, params=params)
if response.status_code != 200:
break
data = response.json()
children = data["data"]["children"]
if not children:
break
for child in children:
post = child["data"]
posts.append({
"title": post["title"],
"author": post.get("author", "[deleted]"),
"score": post["score"],
"num_comments": post["num_comments"],
"created_utc": post["created_utc"],
"url": post["url"],
"selftext": post.get("selftext", "")[:500],
"permalink": f"https://reddit.com{post['permalink']}"
})
after = data["data"].get("after")
if not after:
break
time.sleep(2)
return pd.DataFrame(posts[:limit])
df = scrape_subreddit("python", limit=200)
print(f"Collected {len(df)} posts")
print(df[["title", "score", "num_comments"]].head(10))
Method 2: Using PRAW (Official API)
For more reliable access, use Reddit's official library:
import praw
reddit = praw.Reddit(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_SECRET",
user_agent="DataCollector/1.0"
)
def get_subreddit_data(subreddit_name, sort="hot", limit=100):
subreddit = reddit.subreddit(subreddit_name)
posts = []
sort_method = getattr(subreddit, sort)
for post in sort_method(limit=limit):
posts.append({
"title": post.title,
"author": str(post.author),
"score": post.score,
"upvote_ratio": post.upvote_ratio,
"num_comments": post.num_comments,
"created_utc": post.created_utc,
"url": post.url,
"is_self": post.is_self,
"flair": post.link_flair_text
})
return pd.DataFrame(posts)
df = get_subreddit_data("machinelearning", sort="top", limit=500)
Scraping Comments
Comments contain the real value — opinions, recommendations, and discussions:
def get_post_comments(post_url, depth=3):
url = post_url.rstrip("/") + ".json"
headers = {"User-Agent": "DataCollector/1.0"}
response = requests.get(url, headers=headers)
data = response.json()
comments = []
def parse_comments(comment_data, level=0):
if level >= depth:
return
if isinstance(comment_data, dict):
body = comment_data.get("body", "")
if body:
comments.append({
"author": comment_data.get("author", "[deleted]"),
"body": body,
"score": comment_data.get("score", 0),
"level": level
})
replies = comment_data.get("replies", "")
if isinstance(replies, dict):
children = replies.get("data", {}).get("children", [])
for child in children:
parse_comments(child.get("data", {}), level + 1)
listing = data[1]["data"]["children"]
for item in listing:
parse_comments(item.get("data", {}))
return comments
Handling Anti-Scraping Measures
Reddit actively blocks scrapers. For production-scale collection, use ScraperAPI to handle IP rotation and rate limiting:
def reddit_via_proxy(subreddit):
url = f"https://old.reddit.com/r/{subreddit}/hot.json"
params = {"api_key": "YOUR_SCRAPERAPI_KEY", "url": url}
return requests.get("https://api.scraperapi.com", params=params).json()
ThorData provides datacenter and residential proxies that work well with Reddit's detection systems. For monitoring your scraper's success rate, ScrapeOps offers real-time dashboards.
Analyzing Reddit Data
from datetime import datetime
df["engagement"] = df["score"] + df["num_comments"] * 2
top_posts = df.nlargest(10, "engagement")
print("Top engaged posts:")
for _, row in top_posts.iterrows():
print(f" [{row['score']}pts {row['num_comments']}comments] {row['title'][:80]}")
df["hour"] = pd.to_datetime(df["created_utc"], unit="s").dt.hour
best_hours = df.groupby("hour")["score"].mean().nlargest(5)
print(f"\nBest posting hours: {list(best_hours.index)}")
Ethical Guidelines
- Respect Reddit's robots.txt and rate limits
- Do not scrape private or quarantined subreddits
- Anonymize user data if publishing research
- Add delays between requests (minimum 2 seconds)
- Consider using the official API for commercial projects
Follow for more Python web scraping guides updated for 2026!
Top comments (0)