Reddit is a goldmine of user-generated data — product feedback, market research, trend signals, community sentiment. But scraping it in 2026 is a very different challenge than it was three years ago. After Reddit's controversial API pricing changes in 2023 that killed off third-party apps, the platform tightened its defenses significantly.
This guide covers what actually works in 2026: the technical realities, the right tools for each use case, and production-ready Python code you can run today.
What Changed After the API Pricing War
In June 2023, Reddit introduced tiered API pricing that effectively priced out indie developers. The free tier allows 100 requests per minute for non-commercial use. For anything beyond that, you're looking at $0.24 per 1,000 API calls.
What this meant practically:
- The official API became expensive for large-scale work
- Reddit doubled down on bot detection for scrapers trying to bypass costs
- Rate limiting became more aggressive and context-aware
- User-Agent sniffing and behavioral analysis improved
The good news: Reddit's HTML structure is still accessible, and legitimate scraping at moderate scale remains achievable.
Approach 1: The Official Reddit API (PRAW)
For many use cases, the official API is still the right answer. If you're doing keyword monitoring, subreddit analysis, or building something that needs to stay within Reddit's ToS, PRAW (Python Reddit API Wrapper) is your friend.
import praw
import time
from datetime import datetime, timedelta
reddit = praw.Reddit(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_CLIENT_SECRET",
user_agent="keyword-monitor/1.0 by u/your_username",
username="your_reddit_username",
password="your_reddit_password"
)
def monitor_keyword(subreddit_name: str, keyword: str, limit: int = 100):
"""Monitor a subreddit for posts containing a keyword."""
subreddit = reddit.subreddit(subreddit_name)
results = []
for post in subreddit.new(limit=limit):
if keyword.lower() in post.title.lower() or keyword.lower() in post.selftext.lower():
results.append({
"id": post.id,
"title": post.title,
"score": post.score,
"url": post.url,
"created_utc": datetime.utcfromtimestamp(post.created_utc).isoformat(),
"num_comments": post.num_comments,
"author": str(post.author),
"subreddit": str(post.subreddit),
})
return results
def scrape_comments(post_id: str, depth: int = 5):
"""Scrape all comments from a Reddit post."""
submission = reddit.submission(id=post_id)
submission.comments.replace_more(limit=depth) # Expand "load more" chains
comments = []
for comment in submission.comments.list():
comments.append({
"id": comment.id,
"body": comment.body,
"score": comment.score,
"author": str(comment.author),
"created_utc": datetime.utcfromtimestamp(comment.created_utc).isoformat(),
"parent_id": comment.parent_id,
"depth": comment.depth,
})
return comments
# Usage
posts = monitor_keyword("Python", "scraping", limit=50)
print(f"Found {len(posts)} relevant posts")
for post in posts[:3]:
comments = scrape_comments(post["id"])
print(f"Post: {post['title'][:60]}... ({len(comments)} comments)")
time.sleep(0.5) # Respect rate limits
Rate limits with PRAW: 60 requests per minute for authenticated users, 30 for unauthenticated. PRAW handles this automatically with built-in rate limit handling — it will sleep when you're close to the limit.
When to use PRAW:
- Monitoring specific subreddits for keywords
- Building alert systems for brand mentions
- Academic research within Reddit's data access terms
- Applications that need comment trees or user history
Approach 2: The JSON Trick (No Authentication Required)
Reddit appends .json to any URL to return raw JSON. This is a legitimate, documented feature that Reddit has kept since its early days.
import httpx
import asyncio
import json
from typing import AsyncIterator
HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0; +https://yoursite.com/bot)",
}
async def fetch_subreddit_posts(
subreddit: str,
sort: str = "new",
limit: int = 100,
after: str = None
) -> dict:
"""Fetch posts from a subreddit using the JSON API."""
url = f"https://www.reddit.com/r/{subreddit}/{sort}.json"
params = {"limit": min(limit, 100), "raw_json": 1}
if after:
params["after"] = after
async with httpx.AsyncClient() as client:
response = await client.get(url, headers=HEADERS, params=params)
response.raise_for_status()
return response.json()
async def paginate_subreddit(
subreddit: str,
max_posts: int = 1000,
sort: str = "new"
) -> AsyncIterator[dict]:
"""Paginate through subreddit posts, yielding each post."""
after = None
collected = 0
while collected < max_posts:
data = await fetch_subreddit_posts(subreddit, sort=sort, after=after)
posts = data["data"]["children"]
if not posts:
break
for post in posts:
yield post["data"]
collected += 1
if collected >= max_posts:
break
after = data["data"]["after"]
if not after:
break
# Be polite: 1 second between pages
await asyncio.sleep(1)
async def main():
posts = []
async for post in paginate_subreddit("entrepreneur", max_posts=200):
posts.append({
"title": post["title"],
"score": post["score"],
"url": post["url"],
"created_utc": post["created_utc"],
"selftext": post["selftext"][:500], # First 500 chars
})
print(f"Collected {len(posts)} posts")
# Save to JSONL (better for large datasets than JSON)
with open("reddit_posts.jsonl", "w") as f:
for post in posts:
f.write(json.dumps(post) + "\n")
asyncio.run(main())
The Pushshift alternative: Pushshift.io (now arctic-shift.github.io) maintains historical Reddit archives going back to 2005. For research needing historical data, this is far more efficient than paginating Reddit directly.
import httpx
async def search_pushshift(query: str, subreddit: str = None, after: int = None, before: int = None, size: int = 100):
"""Search Reddit comments via Arctic Shift (Pushshift successor)."""
url = "https://arctic-shift.photon-reddit.com/api/comments/search"
params = {
"q": query,
"size": size,
"sort": "desc",
"sort_type": "created_utc",
}
if subreddit:
params["subreddit"] = subreddit
if after:
params["after"] = after
if before:
params["before"] = before
async with httpx.AsyncClient() as client:
response = await client.get(url, params=params)
response.raise_for_status()
return response.json()["data"]
Approach 3: Browser Automation for Dynamic Content
Some Reddit content (heavily moderated threads, award-heavy posts, ads) loads differently through APIs vs. a browser. For edge cases, Playwright gives you a real browser:
from playwright.async_api import async_playwright
import asyncio
import json
async def scrape_reddit_thread(url: str) -> dict:
"""Scrape a Reddit thread using a real browser."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
viewport={"width": 1280, "height": 720},
)
page = await context.new_page()
# Block tracking/ad requests to speed things up
await page.route("**/*.{png,jpg,jpeg,gif,webp,svg}", lambda route: route.abort())
await page.route("**/{analytics,tracking,ads}**", lambda route: route.abort())
await page.goto(url, wait_until="networkidle")
# Extract the JSON data Reddit embeds in the page
json_data = await page.evaluate("""
() => {
const scripts = document.querySelectorAll('script[id^="t3_"]');
const data = {};
scripts.forEach(s => {
try { Object.assign(data, JSON.parse(s.textContent)); } catch(e) {}
});
return data;
}
""")
await browser.close()
return json_data
Rate Limiting: The Right Strategy
Getting rate-limited on Reddit means your IP gets a temporary ban — usually 10 minutes to a few hours. Here's how to stay under the radar:
1. Respect the crawl delay
import asyncio
import random
from dataclasses import dataclass, field
from collections import deque
from time import monotonic
@dataclass
class RateLimiter:
requests_per_minute: int = 30
_timestamps: deque = field(default_factory=deque)
async def acquire(self):
now = monotonic()
# Remove timestamps older than 60 seconds
while self._timestamps and now - self._timestamps[0] > 60:
self._timestamps.popleft()
if len(self._timestamps) >= self.requests_per_minute:
# Wait until the oldest request is 60 seconds old
sleep_time = 60 - (now - self._timestamps[0])
if sleep_time > 0:
await asyncio.sleep(sleep_time)
# Add jitter to avoid pattern detection
await asyncio.sleep(random.uniform(0.5, 1.5))
self._timestamps.append(monotonic())
2. Rotate User-Agents (but don't overdo it)
Reddit's bot detection looks for inconsistent User-Agent rotation — changing UA on every request is a red flag. Rotate between 3-5 realistic browser strings and stick with each for at least 10 minutes.
3. Handle 429s gracefully with exponential backoff
import httpx
import asyncio
async def fetch_with_backoff(url: str, headers: dict, max_retries: int = 5) -> httpx.Response:
"""Fetch a URL with exponential backoff on rate limit errors."""
for attempt in range(max_retries):
async with httpx.AsyncClient() as client:
response = await client.get(url, headers=headers, timeout=30)
if response.status_code == 200:
return response
elif response.status_code == 429:
# Check Retry-After header
retry_after = int(response.headers.get("Retry-After", 60))
wait_time = max(retry_after, 2 ** attempt * 10)
print(f"Rate limited. Waiting {wait_time}s (attempt {attempt + 1}/{max_retries})")
await asyncio.sleep(wait_time)
elif response.status_code in (403, 503):
wait_time = 2 ** attempt * 30
print(f"Blocked (HTTP {response.status_code}). Waiting {wait_time}s")
await asyncio.sleep(wait_time)
else:
response.raise_for_status()
raise Exception(f"Failed after {max_retries} attempts")
Common Pitfalls and How to Avoid Them
Pitfall 1: Scraping old.reddit.com vs. new.reddit.com
The old Reddit interface (old.reddit.com) has simpler HTML and is less JavaScript-heavy. The JSON trick works on both, but if you're doing HTML scraping, old Reddit is more reliable.
Pitfall 2: The replace_more trap in PRAW
When you call submission.comments.replace_more(limit=None), PRAW makes a separate API request for each "MoreComments" object. On a viral thread with 5,000 comments, this can mean hundreds of API calls. Use limit=10 or limit=20 for most use cases, and only limit=None when you genuinely need every comment.
Pitfall 3: Storing raw Reddit data
Reddit's API ToS limits how long you can store certain data types. User account data in particular has deletion propagation requirements — if a user deletes their account, you must delete their data too (if storing beyond 90 days). Design your pipeline with this in mind.
Pitfall 4: Ignoring flairs and crosspost data
Many developers miss that the crosspost_parent_list field can cascade indefinitely. If you're recursively following crossposts, add a depth limit.
Approach 4: Using a Dedicated Scraping Service
For production workloads requiring thousands of posts per day, maintaining your own scraper is expensive. You need to handle IP rotation, CAPTCHA solving, browser fingerprinting, and Reddit's evolving bot detection.
This is where dedicated Reddit scraping APIs earn their keep. Services like the Reddit Comment Scraper on Apify handle the infrastructure for you — you get clean structured data without maintaining proxies or dealing with rate limits yourself.
The Apify platform's pay-per-result model means you only pay for actual data delivered:
import httpx
def scrape_reddit_via_apify(subreddit: str, keyword: str, max_posts: int = 500) -> list:
"""Use Apify's Reddit scraper actor for large-scale collection."""
# Example integration pattern with Apify actors
api_token = "YOUR_APIFY_TOKEN"
actor_id = "cryptosignals/reddit-comment-scraper"
# Start a run
run_response = httpx.post(
f"https://api.apify.com/v2/acts/{actor_id}/runs",
headers={"Authorization": f"Bearer {api_token}"},
json={
"subreddit": subreddit,
"keyword": keyword,
"maxPosts": max_posts,
}
)
run_id = run_response.json()["data"]["id"]
# Wait for completion and fetch results
# (In production, use webhooks instead of polling)
import time
while True:
status = httpx.get(
f"https://api.apify.com/v2/acts/{actor_id}/runs/{run_id}",
headers={"Authorization": f"Bearer {api_token}"}
).json()["data"]["status"]
if status in ("SUCCEEDED", "FAILED", "ABORTED"):
break
time.sleep(5)
# Fetch dataset
results = httpx.get(
f"https://api.apify.com/v2/acts/{actor_id}/runs/{run_id}/dataset/items",
headers={"Authorization": f"Bearer {api_token}"}
).json()
return results
When to use a service vs. DIY:
- DIY: < 10,000 posts/day, one-time research, academic use
- Service: Production pipelines, > 50,000 posts/day, teams without scraping expertise, data that needs to be current
Putting It Together: A Complete Keyword Monitor
Here's a production-ready script that combines PRAW (for reliability) with pushshift (for history), saves to SQLite, and sends alerts:
import praw
import sqlite3
import httpx
import json
from datetime import datetime
from dataclasses import dataclass, asdict
@dataclass
class RedditPost:
id: str
subreddit: str
title: str
selftext: str
score: int
url: str
permalink: str
author: str
created_utc: float
num_comments: int
keyword_matched: str
def init_db(db_path: str = "reddit_monitor.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS posts (
id TEXT PRIMARY KEY,
subreddit TEXT,
title TEXT,
selftext TEXT,
score INTEGER,
url TEXT,
permalink TEXT,
author TEXT,
created_utc REAL,
num_comments INTEGER,
keyword_matched TEXT,
scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
)
""")
conn.commit()
return conn
def save_post(conn: sqlite3.Connection, post: RedditPost):
conn.execute(
"INSERT OR IGNORE INTO posts VALUES (?,?,?,?,?,?,?,?,?,?,?,CURRENT_TIMESTAMP)",
(post.id, post.subreddit, post.title, post.selftext, post.score,
post.url, post.permalink, post.author, post.created_utc,
post.num_comments, post.keyword_matched)
)
conn.commit()
def monitor(subreddits: list[str], keywords: list[str], reddit: praw.Reddit):
conn = init_db()
for subreddit_name in subreddits:
subreddit = reddit.subreddit(subreddit_name)
for post in subreddit.new(limit=100):
text = f"{post.title} {post.selftext}".lower()
for keyword in keywords:
if keyword.lower() in text:
p = RedditPost(
id=post.id,
subreddit=subreddit_name,
title=post.title,
selftext=post.selftext[:1000],
score=post.score,
url=post.url,
permalink=f"https://reddit.com{post.permalink}",
author=str(post.author),
created_utc=post.created_utc,
num_comments=post.num_comments,
keyword_matched=keyword,
)
save_post(conn, p)
print(f"[{keyword}] {post.title[:80]}")
break # Don't double-count if multiple keywords match
conn.close()
# Run every hour via cron:
# 0 * * * * python3 /path/to/monitor.py
if __name__ == "__main__":
reddit = praw.Reddit(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_CLIENT_SECRET",
user_agent="keyword-monitor/1.0",
)
monitor(
subreddits=["entrepreneur", "startups", "SaaS", "Python"],
keywords=["scraping", "data collection", "API alternative"],
reddit=reddit,
)
Summary: Which Approach for Which Situation
| Use Case | Recommended Approach | Why |
|---|---|---|
| Keyword monitoring, ToS-compliant | PRAW + official API | Reliable, structured, handles rate limits |
| One-time research, moderate scale | JSON trick + asyncio | No auth needed, fast |
| Historical data (pre-2023) | Arctic Shift / Pushshift | Much faster than paginating live |
| Production pipeline, high volume | Apify or similar service | Infrastructure handled, no IP bans |
| Edge cases, dynamic content | Playwright | Full browser rendering |
Reddit scraping in 2026 is viable when done thoughtfully. Respect rate limits, handle errors gracefully, and choose the right tool for your scale. The platforms that try to brute-force it end up blocked within hours. The ones that treat Reddit's infrastructure with respect can run indefinitely.
Top comments (1)
the timeout problem is what pushed me away from self-hosted chromium
snapapi.pics handles this externally — you set a timeout in the request, their infra deals with the rest