Reddit killed its free API in July 2023. What used to be a simple praw call now requires OAuth approval that takes weeks, rate limits that make bulk collection useless, and pricing that starts at $0.24 per 1,000 API calls.
But Reddit's data is still public. And there are still ways to collect it — legally, reliably, and at scale. Here's what actually works in 2026.
Method 1: Reddit's Hidden JSON Endpoints
This is the best-kept secret in web scraping. Reddit serves JSON for every single page. Just append .json to any URL:
https://www.reddit.com/r/technology/top.json?t=week&limit=25
No API key. No OAuth. No approval process. Just raw JSON.
Here's a working Python example:
import requests
import time
def scrape_subreddit(subreddit, sort="hot", limit=25):
url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit={limit}"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers)
if response.status_code == 429:
print("Rate limited. Waiting 60s...")
time.sleep(60)
return scrape_subreddit(subreddit, sort, limit)
if response.status_code != 200:
raise Exception(f"HTTP {response.status_code}")
data = response.json()
posts = []
for child in data["data"]["children"]:
post = child["data"]
posts.append({
"title": post["title"],
"score": post["score"],
"url": post["url"],
"author": post["author"],
"created_utc": post["created_utc"],
"num_comments": post["num_comments"],
"selftext": post.get("selftext", ""),
"permalink": f"https://reddit.com{post['permalink']}"
})
return posts, data["data"].get("after") # 'after' token for pagination
# Fetch top posts from r/technology
posts, after_token = scrape_subreddit("technology", sort="top")
for p in posts[:5]:
print(f"[{p['score']}] {p['title']}")
Pagination works with the after parameter:
def scrape_all_pages(subreddit, sort="top", max_pages=5):
all_posts = []
after = None
for page in range(max_pages):
url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit=100"
if after:
url += f"&after={after}"
headers = {"User-Agent": "DataCollector/2.0 (research project)"}
resp = requests.get(url, headers=headers)
data = resp.json()
children = data["data"]["children"]
if not children:
break
all_posts.extend([c["data"] for c in children])
after = data["data"].get("after")
if not after:
break
time.sleep(2) # Be respectful
return all_posts
Limitations: Reddit rate-limits these endpoints aggressively. You'll get 429 errors after ~60 requests per minute from a single IP. For casual scraping, this is fine. For anything bigger, you need Method 2.
Method 2: Proxy Rotation for Scale
The JSON endpoint works — until Reddit recognizes your IP. The fix is rotating residential proxies.
ScraperAPI handles this automatically: proxy rotation, CAPTCHA solving, and retry logic in a single API call.
import requests
SCRAPER_API_KEY = "your_key_here"
def scrape_with_proxy(url):
payload = {
"api_key": SCRAPER_API_KEY,
"url": url,
"render": "false"
}
resp = requests.get("https://api.scraperapi.com", params=payload)
return resp.json()
# Scrape without worrying about blocks
data = scrape_with_proxy(
"https://www.reddit.com/r/technology/top.json?t=month&limit=100"
)
print(f"Got {len(data['data']['children'])} posts")
With ScraperAPI, you get:
- 40M+ residential IPs — Reddit can't block you
- Automatic retries on failures
- Geotargeting if you need location-specific results
- Free tier with 5,000 API credits to test
This is the move when you need 1,000+ posts or are scraping continuously.
Method 3: Pre-Built Scrapers (Zero Code)
If you don't want to write code at all, Apify's Reddit Scraper handles everything — pagination, rate limits, proxy rotation, structured output.
You configure it with a subreddit URL, set the number of posts, and it exports clean JSON or CSV. It's useful for one-off data collection, market research, or feeding data into an analysis pipeline.
You can also call it programmatically:
from apify_client import ApifyClient
client = ApifyClient("your_apify_token")
run = client.actor("cryptosignals/reddit-scraper").call(
run_input={
"startUrls": [{"url": "https://www.reddit.com/r/technology/"}],
"maxItems": 500,
"sort": "top",
"time": "month"
}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["title"], item["score"])
Complete Example: Monitor r/technology Daily
Here's a production-ready script that scrapes daily, deduplicates, and saves to CSV:
import requests
import csv
import time
import os
from datetime import datetime
SUBREDDIT = "technology"
OUTPUT_FILE = "reddit_technology.csv"
SEEN_IDS_FILE = "seen_ids.txt"
def load_seen_ids():
if os.path.exists(SEEN_IDS_FILE):
with open(SEEN_IDS_FILE) as f:
return set(f.read().splitlines())
return set()
def save_seen_ids(ids):
with open(SEEN_IDS_FILE, "w") as f:
f.write("\n".join(ids))
def scrape_top_posts(subreddit, time_filter="day", limit=100):
url = f"https://www.reddit.com/r/{subreddit}/top.json?t={time_filter}&limit={limit}"
headers = {
"User-Agent": "TopPostTracker/1.0 (monitoring project)"
}
resp = requests.get(url, headers=headers)
resp.raise_for_status()
return [
{
"id": c["data"]["id"],
"title": c["data"]["title"],
"score": c["data"]["score"],
"author": c["data"]["author"],
"url": c["data"]["url"],
"comments": c["data"]["num_comments"],
"created": datetime.utcfromtimestamp(
c["data"]["created_utc"]
).isoformat(),
"scraped_at": datetime.utcnow().isoformat()
}
for c in resp.json()["data"]["children"]
]
def main():
seen = load_seen_ids()
posts = scrape_top_posts(SUBREDDIT)
new_posts = [p for p in posts if p["id"] not in seen]
if not new_posts:
print("No new posts found.")
return
file_exists = os.path.exists(OUTPUT_FILE)
with open(OUTPUT_FILE, "a", newline="") as f:
writer = csv.DictWriter(f, fieldnames=new_posts[0].keys())
if not file_exists:
writer.writeheader()
writer.writerows(new_posts)
seen.update(p["id"] for p in new_posts)
save_seen_ids(seen)
print(f"Saved {len(new_posts)} new posts ({len(posts) - len(new_posts)} duplicates skipped)")
if __name__ == "__main__":
main()
Run this with cron once a day and you've got a free Reddit monitoring pipeline.
Anti-Bot Tips
Reddit's anti-scraping has gotten smarter. Here's how to avoid detection:
1. Rotate User-Agents
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101",
]
headers = {"User-Agent": random.choice(USER_AGENTS)}
2. Rate limit yourself — 1 request every 2 seconds minimum. Reddit tracks request patterns.
3. Respect 429s — Back off exponentially:
def request_with_backoff(url, headers, max_retries=5):
for attempt in range(max_retries):
resp = requests.get(url, headers=headers)
if resp.status_code == 200:
return resp
if resp.status_code == 429:
wait = 2 ** attempt * 10 # 10s, 20s, 40s, 80s, 160s
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
else:
resp.raise_for_status()
raise Exception("Max retries exceeded")
4. Use sessions — requests.Session() reuses TCP connections and looks more like a real browser.
5. Don't scrape logged-in pages — Stick to public endpoints. Scraping behind auth violates Reddit's TOS.
When to Use Each Method
| Method | Best For | Cost | Scale |
|---|---|---|---|
| JSON endpoints | Side projects, research, <1K posts | Free | Low |
| ScraperAPI + proxies | Production pipelines, daily collection | ~$49/mo | High |
| Apify pre-built | One-off exports, non-developers | Pay per use | Medium |
My recommendation: Start with Method 1. It's free and handles most use cases. When you hit rate limits consistently, add ScraperAPI for proxy rotation. Only go to Apify if you need a managed solution.
Key Takeaways
- Reddit's
.jsonendpoints are still the easiest way to get structured data - Always rotate User-Agents and respect rate limits
- For scale, proxy rotation is non-negotiable
- Save yourself time — deduplicate with post IDs, not URLs
- Stick to public data. Don't scrape anything that requires login
The code in this article is tested and working as of March 2026. Reddit changes things periodically, so if something breaks, check the response format first — the field names occasionally shift.
Building a scraping pipeline? I write about Python automation, web scraping, and developer tools. Follow for more practical guides.
Top comments (0)