Reddit used to be the Wild West of data scraping. Dozens of third-party apps pulled millions of posts per day without friction. Then 2023 happened.
In June 2023, Reddit announced API pricing that would have charged $0.24 per 1,000 API calls for commercial use. Overnight, tools like Apollo, Infinity, and RIF died. The community raged. Elon's free-speech narrative made Twitter look appealing. Chaos.
But here's what most people got wrong: Reddit data isn't actually locked down. It's just more intentional now. You can still scrape Reddit legally and effectively — you just need to know what actually works in 2026.
This guide covers three methods with working code examples, legal context you actually need to understand, and honest assessment of when each approach makes sense.
The Reddit API Landscape in 2026
Before we dive into methods, let's clarify what Reddit is actually protecting:
Reddit's Official API is still free for non-commercial, non-competitive use. The $0.24/1,000 calls pricing applies to commercial services that extract and republish Reddit data. If you're building tools for yourself, a research project, or offering services that add value (not just republishing), you're in the clear.
Rate limits are per-user, per-endpoint, and generous for legitimate use: roughly 60 requests per minute.
The catch: Reddit aggressively monitors for scrapers and will block you if you look suspicious. Every method below includes mitigations.
Method 1: Reddit's Official API with PRAW (Easiest Path)
If you're building anything beyond one-off data collection, this is your starting point.
PRAW (Python Reddit API Wrapper) is the maintained, endorsed way to access Reddit data. It handles authentication, rate limiting, and error handling for you.
Setup
pip install praw
Code Example: Scrape Recent Posts from a Subreddit
import praw
from datetime import datetime
# Authentication
reddit = praw.Reddit(
client_id='YOUR_CLIENT_ID',
client_secret='YOUR_CLIENT_SECRET',
user_agent='DataCollector/1.0 (by YourRedditUsername)'
)
# Scrape posts from r/Python
subreddit = reddit.subreddit('Python')
posts = []
for submission in subreddit.new(limit=100):
posts.append({
'title': submission.title,
'score': submission.score,
'author': submission.author.name if submission.author else '[deleted]',
'timestamp': datetime.fromtimestamp(submission.created_utc),
'url': submission.url,
'text': submission.selftext,
'upvote_ratio': submission.upvote_ratio,
'num_comments': submission.num_comments
})
for post in posts:
print(f"{post['title']} ({post['score']} upvotes)")
How to Get Credentials
- Go to https://www.reddit.com/prefs/apps
- Click "Create an application"
- Choose "script" (for personal use)
- Name it anything (e.g., "My Data Collector")
- Copy the client_id and client_secret
- Set a strong user_agent: include your purpose and Reddit username
Limits & Reality Check
- Rate limit: ~60 requests/minute
- Cost: Free (forever, for non-commercial)
- Data available: Posts, comments, user profiles, subreddit metadata
- Cannot scrape: Private messages, deleted content, vote counts (only current votes visible)
- Speed: 100-500 posts per API key per session before throttling kicks in hard
When to use this: Monitoring specific subreddits, building tools for Reddit users, research with <50K posts.
Method 2: Historical Data with Arctic Shift & Academic Torrents
If you need historical Reddit data (posts from 2015, 2018, etc.), the official API won't help you.
Pushshift used to be the standard here. It was a massive archive of ~500M Reddit submissions spanning 2007–2023. Then Reddit shut it down in 2024 after legal pressure, citing the CFAA.
The successor is Arctic Shift — an academic project hosting Reddit data dumps via Academic Torrents.
How It Works
Arctic Shift publishes monthly archives of Reddit data (posts, comments, metadata) as torrent files. You download entire months, then query locally or load into a database.
Setup
- Download a torrent client (qBittorrent, Transmission)
- Visit: https://academictorrents.com/details/56aa49f5665710803c11137e53931c63ecd12126
- Choose the month/year you need
- Download the torrent (100GB–500GB per year, compressed)
- Extract and load into SQLite, Postgres, or analyze with pandas
Code Example: Query Local Arctic Shift Data
Assuming you've downloaded and extracted data into /data/reddit_posts.csv:
import pandas as pd
import sqlite3
# Load CSV (if using CSV export from torrent)
df = pd.read_csv('/data/reddit_posts_2023.csv')
# Filter by subreddit and date range
python_posts = df[
(df['subreddit'] == 'Python') &
(df['created_utc'] >= 1672531200) & # Jan 1, 2023
(df['created_utc'] <= 1704067199) # Dec 31, 2023
]
print(f"Found {len(python_posts)} posts in r/Python during 2023")
# Or load into SQLite for larger datasets
conn = sqlite3.connect('/data/reddit.db')
df.to_sql('posts', conn, if_exists='append', index=False)
# Query with SQL
query = """
SELECT subreddit, COUNT(*) as post_count, AVG(score) as avg_score
FROM posts
WHERE created_utc >= ? AND created_utc <= ?
GROUP BY subreddit
ORDER BY post_count DESC
LIMIT 10
"""
result = pd.read_sql_query(query, conn, params=(1672531200, 1704067199))
print(result)
Limits & Reality Check
- Data freshness: 1–2 months behind (published on ~1-month lag)
- Cost: Free (via torrent)
- Scope: Historical only; good for research and sentiment analysis
- Size: Manageable with modern hardware; 1 year ≈ 50–100GB compressed
- Legal: Academic use is protected; commercial use is grayer (see Legal section below)
When to use this: Historical analysis, training ML models, trend research, comparative studies across years.
Method 3: Web Scraping (old.reddit.com)
If you need fresh data and aren't bound by rate limits (willing to be slower), you can scrape the old Reddit UI directly.
Modern Reddit (reddit.com) is JavaScript-heavy and anti-scraper. But old.reddit.com is static HTML — much simpler to scrape.
Why This Works
-
old.reddit.comreturns plain HTML without heavy JavaScript - No rate limit as aggressive as the API (from the scraper's perspective)
- You control timing, so you can be polite
- Works for posts, comments, user profiles
Code Example: Scrape Posts with Beautiful Soup
import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import urljoin
# Headers to avoid being blocked
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
def scrape_old_reddit(subreddit, limit=100):
"""Scrape posts from old.reddit.com"""
posts = []
url = f'https://old.reddit.com/r/{subreddit}/'
while len(posts) < limit:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
# Extract posts from the listing
for post_elem in soup.find_all('div', class_='thing'):
# Skip stickied posts or ads
if 'stickied' in post_elem.get('class', []):
continue
title_elem = post_elem.find('a', class_='title')
score_elem = post_elem.find('div', class_='score')
author_elem = post_elem.find('a', class_='author')
if not title_elem:
continue
posts.append({
'title': title_elem.get_text(strip=True),
'url': urljoin('https://old.reddit.com', title_elem.get('href', '')),
'score': score_elem.get_text(strip=True) if score_elem else 'N/A',
'author': author_elem.get_text(strip=True) if author_elem else '[deleted]',
'subreddit': subreddit
})
# Find "next" button for pagination
next_button = soup.find('a', rel='next')
if not next_button or len(posts) >= limit:
break
url = urljoin('https://old.reddit.com', next_button.get('href', ''))
time.sleep(2) # Be polite: 2-second delay between requests
return posts[:limit]
# Example usage
posts = scrape_old_reddit('Python', limit=50)
for post in posts:
print(f"{post['title']} by {post['author']}")
Install Dependencies
pip install requests beautifulsoup4
Limits & Reality Check
- Speed: 10–30 posts per minute (with 2-second delays)
- Cost: Free
- Reliability: Fragile to DOM changes; Reddit can break this with a UI update
- Rate limiting: Less aggressive than the API, but if you hammer it you'll get blocked
- Legal risk: Higher than PRAW (see Legal section)
When to use this: Small one-off scrapes, real-time monitoring of a few subreddits, prototyping before moving to official API.
Comparison Table
| Method | Speed | Cost | Freshness | Legal Risk | Best For |
|---|---|---|---|---|---|
| PRAW (Official API) | 60 req/min | Free | Real-time | Lowest | Production, research, monitoring |
| Arctic Shift | N/A (bulk) | Free | 1-2 months lag | Medium | Historical analysis, ML training |
| Web Scraping (old.reddit.com) | 10–30/min | Free | Real-time | Higher | Prototyping, small datasets |
Legal Considerations: What You Actually Need to Know
Reddit's terms of service explicitly forbid scraping. But TOS violations aren't laws. Here's what actually matters:
Reddit's Terms of Service
Violating Reddit's TOS can get your account/IP banned, but it's not a criminal issue. Reddit will send cease-and-desist notices before escalating.
What's forbidden explicitly: Automated access without permission, except through the official API.
Exception: Non-commercial research is implicitly allowed (see Reddit's guidelines).
The Computer Fraud and Abuse Act (CFAA)
The CFAA makes unauthorized computer access illegal. But "accessing public data" isn't unauthorized access — the data is publicly available on the web. This was established in the landmark HiQ v. LinkedIn (2017) case.
The ruling: Scraping publicly available data is legal, even if the site's TOS forbids it. LinkedIn tried to sue HiQ for scraping public profiles. The court sided with HiQ. The same logic applies to Reddit.
But there's nuance:
- If you violate TOS and cause damage (DOS, stealing compute, massive load), you could face CFAA charges.
- State laws vary. CFAA interpretation changes with new administrations.
- If you're republishing Reddit data as-is, you may infringe copyright (Reddit users own their posts' copyrights).
The Safe Path
- Use the official API when possible. It's endorsed and safest.
- Be polite when scraping. Add delays, respect robots.txt patterns, identify yourself in User-Agent.
- Don't republish. Summarize, analyze, or add value. Don't just copy Reddit posts into your own site.
- Document your purpose. "Research" or "personal project" is fine. "Competitive scraping" is not.
- Respect account ToS. Use non-commercial accounts. Don't claim commercial status if you're not.
For production scale, services like ScraperAPI handle rate limiting, IP rotation, and legal compliance. They absorb the legal risk in their terms. That's worth the cost if you're doing this commercially.
Real-World Example: Building a Reddit Sentiment Dashboard
Let's say you want to monitor r/stocks, r/investing, and r/crypto for real-time sentiment. Here's the hybrid approach:
import praw
import time
from datetime import datetime, timedelta
reddit = praw.Reddit(
client_id='YOUR_CLIENT_ID',
client_secret='YOUR_CLIENT_SECRET',
user_agent='SentimentMonitor/1.0'
)
def monitor_subreddits(subreddits, check_interval_minutes=30):
"""Monitor multiple subreddits for sentiment"""
last_check = datetime.utcnow() - timedelta(hours=1)
while True:
for subreddit_name in subreddits:
subreddit = reddit.subreddit(subreddit_name)
# Get posts from the last check interval
new_posts = []
for submission in subreddit.new(limit=50):
if datetime.fromtimestamp(submission.created_utc) > last_check:
new_posts.append({
'title': submission.title,
'score': submission.score,
'num_comments': submission.num_comments,
'created': datetime.fromtimestamp(submission.created_utc),
'url': submission.url
})
if new_posts:
avg_score = sum(p['score'] for p in new_posts) / len(new_posts)
print(f"\nr/{subreddit_name}: {len(new_posts)} new posts, avg score: {avg_score:.1f}")
# Top post
top = sorted(new_posts, key=lambda x: x['score'], reverse=True)[0]
print(f" Top: {top['title'][:60]}... ({top['score']} points)")
last_check = datetime.utcnow()
print(f"\n[{datetime.now().strftime('%H:%M:%S')}] Sleeping {check_interval_minutes} minutes...")
time.sleep(check_interval_minutes * 60)
# Run it
monitor_subreddits(['stocks', 'investing', 'crypto'])
This approach stays within API limits, requires no scraping, and runs indefinitely.
What Changed Since 2023?
- Reddit killed Pushshift (2024): No more free archival API. Arctic Shift filled the gap for academics.
- API pricing stabilized: No price increases since the 2023 announcement. If you're non-commercial, it's still free.
- Third-party apps went extinct: But that was about user-facing clients, not data extraction. Bots and data scrapers adapted.
- Detection improved: Reddit now has better bot detection. Using PRAW with a legitimate account is safer than raw HTTP requests.
Resources
- Official Reddit API Docs: https://www.reddit.com/dev/api/
- PRAW Documentation: https://praw.readthedocs.io/
- Arctic Shift Torrents: https://academictorrents.com/details/56aa49f5665710803c11137e53931c63ecd12126
- Previous articles you might like:
TL;DR
- Use PRAW for anything ongoing or production. It's free, fast, and legal.
- Use Arctic Shift if you need historical data. Academic torrents, powerful, but slow.
- Scrape old.reddit.com only for small one-offs. It breaks easily and carries legal ambiguity.
- Be respectful. Add delays, use a real account, don't republish.
- For scale, use ScraperAPI to handle the details.
Reddit data is accessible in 2026. You just need to know which door to knock on.
Want to stay ahead of scraping trends? Subscribe to The Data Collector for working code, legal updates, and practical data extraction strategies. No fluff — just what works.
Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.
Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.
Compare web scraping APIs:
- ScraperAPI — 5,000 free credits, 50+ countries, structured data parsing
- Scrape.do — From $29/mo, strong Cloudflare bypass
- ScrapeOps — Proxy comparison + monitoring dashboard
Need custom web scraping? Email hustler@curlship.com — fast turnaround, fair pricing.
Top comments (0)