DEV Community

agenthustler
agenthustler

Posted on

How to Scrape Reddit Without Getting Blocked: A 2026 Guide

Reddit killed its free API in July 2023. What used to be a simple praw call now requires OAuth approval that takes weeks, rate limits that make bulk collection useless, and pricing that starts at $0.24 per 1,000 API calls.

But Reddit's data is still public. And there are still ways to collect it — legally, reliably, and at scale. Here's what actually works in 2026.

Method 1: Reddit's Hidden JSON Endpoints

This is the best-kept secret in web scraping. Reddit serves JSON for every single page. Just append .json to any URL:

https://www.reddit.com/r/technology/top.json?t=week&limit=25
Enter fullscreen mode Exit fullscreen mode

No API key. No OAuth. No approval process. Just raw JSON.

Here's a working Python example:

import requests
import time

def scrape_subreddit(subreddit, sort="hot", limit=25):
    url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit={limit}"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }

    response = requests.get(url, headers=headers)

    if response.status_code == 429:
        print("Rate limited. Waiting 60s...")
        time.sleep(60)
        return scrape_subreddit(subreddit, sort, limit)

    if response.status_code != 200:
        raise Exception(f"HTTP {response.status_code}")

    data = response.json()
    posts = []

    for child in data["data"]["children"]:
        post = child["data"]
        posts.append({
            "title": post["title"],
            "score": post["score"],
            "url": post["url"],
            "author": post["author"],
            "created_utc": post["created_utc"],
            "num_comments": post["num_comments"],
            "selftext": post.get("selftext", ""),
            "permalink": f"https://reddit.com{post['permalink']}"
        })

    return posts, data["data"].get("after")  # 'after' token for pagination

# Fetch top posts from r/technology
posts, after_token = scrape_subreddit("technology", sort="top")
for p in posts[:5]:
    print(f"[{p['score']}] {p['title']}")
Enter fullscreen mode Exit fullscreen mode

Pagination works with the after parameter:

def scrape_all_pages(subreddit, sort="top", max_pages=5):
    all_posts = []
    after = None

    for page in range(max_pages):
        url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit=100"
        if after:
            url += f"&after={after}"

        headers = {"User-Agent": "DataCollector/2.0 (research project)"}
        resp = requests.get(url, headers=headers)
        data = resp.json()

        children = data["data"]["children"]
        if not children:
            break

        all_posts.extend([c["data"] for c in children])
        after = data["data"].get("after")

        if not after:
            break

        time.sleep(2)  # Be respectful

    return all_posts
Enter fullscreen mode Exit fullscreen mode

Limitations: Reddit rate-limits these endpoints aggressively. You'll get 429 errors after ~60 requests per minute from a single IP. For casual scraping, this is fine. For anything bigger, you need Method 2.

Method 2: Proxy Rotation for Scale

The JSON endpoint works — until Reddit recognizes your IP. The fix is rotating residential proxies.

ScraperAPI handles this automatically: proxy rotation, CAPTCHA solving, and retry logic in a single API call.

import requests

SCRAPER_API_KEY = "your_key_here"

def scrape_with_proxy(url):
    payload = {
        "api_key": SCRAPER_API_KEY,
        "url": url,
        "render": "false"
    }
    resp = requests.get("https://api.scraperapi.com", params=payload)
    return resp.json()

# Scrape without worrying about blocks
data = scrape_with_proxy(
    "https://www.reddit.com/r/technology/top.json?t=month&limit=100"
)
print(f"Got {len(data['data']['children'])} posts")
Enter fullscreen mode Exit fullscreen mode

With ScraperAPI, you get:

  • 40M+ residential IPs — Reddit can't block you
  • Automatic retries on failures
  • Geotargeting if you need location-specific results
  • Free tier with 5,000 API credits to test

This is the move when you need 1,000+ posts or are scraping continuously.

Method 3: Pre-Built Scrapers (Zero Code)

If you don't want to write code at all, Apify's Reddit Scraper handles everything — pagination, rate limits, proxy rotation, structured output.

You configure it with a subreddit URL, set the number of posts, and it exports clean JSON or CSV. It's useful for one-off data collection, market research, or feeding data into an analysis pipeline.

You can also call it programmatically:

from apify_client import ApifyClient

client = ApifyClient("your_apify_token")

run = client.actor("cryptosignals/reddit-scraper").call(
    run_input={
        "startUrls": [{"url": "https://www.reddit.com/r/technology/"}],
        "maxItems": 500,
        "sort": "top",
        "time": "month"
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["title"], item["score"])
Enter fullscreen mode Exit fullscreen mode

Complete Example: Monitor r/technology Daily

Here's a production-ready script that scrapes daily, deduplicates, and saves to CSV:

import requests
import csv
import time
import os
from datetime import datetime

SUBREDDIT = "technology"
OUTPUT_FILE = "reddit_technology.csv"
SEEN_IDS_FILE = "seen_ids.txt"

def load_seen_ids():
    if os.path.exists(SEEN_IDS_FILE):
        with open(SEEN_IDS_FILE) as f:
            return set(f.read().splitlines())
    return set()

def save_seen_ids(ids):
    with open(SEEN_IDS_FILE, "w") as f:
        f.write("\n".join(ids))

def scrape_top_posts(subreddit, time_filter="day", limit=100):
    url = f"https://www.reddit.com/r/{subreddit}/top.json?t={time_filter}&limit={limit}"
    headers = {
        "User-Agent": "TopPostTracker/1.0 (monitoring project)"
    }

    resp = requests.get(url, headers=headers)
    resp.raise_for_status()

    return [
        {
            "id": c["data"]["id"],
            "title": c["data"]["title"],
            "score": c["data"]["score"],
            "author": c["data"]["author"],
            "url": c["data"]["url"],
            "comments": c["data"]["num_comments"],
            "created": datetime.utcfromtimestamp(
                c["data"]["created_utc"]
            ).isoformat(),
            "scraped_at": datetime.utcnow().isoformat()
        }
        for c in resp.json()["data"]["children"]
    ]

def main():
    seen = load_seen_ids()
    posts = scrape_top_posts(SUBREDDIT)

    new_posts = [p for p in posts if p["id"] not in seen]

    if not new_posts:
        print("No new posts found.")
        return

    file_exists = os.path.exists(OUTPUT_FILE)
    with open(OUTPUT_FILE, "a", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=new_posts[0].keys())
        if not file_exists:
            writer.writeheader()
        writer.writerows(new_posts)

    seen.update(p["id"] for p in new_posts)
    save_seen_ids(seen)

    print(f"Saved {len(new_posts)} new posts ({len(posts) - len(new_posts)} duplicates skipped)")

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Run this with cron once a day and you've got a free Reddit monitoring pipeline.

Anti-Bot Tips

Reddit's anti-scraping has gotten smarter. Here's how to avoid detection:

1. Rotate User-Agents

import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101",
]

headers = {"User-Agent": random.choice(USER_AGENTS)}
Enter fullscreen mode Exit fullscreen mode

2. Rate limit yourself — 1 request every 2 seconds minimum. Reddit tracks request patterns.

3. Respect 429s — Back off exponentially:

def request_with_backoff(url, headers, max_retries=5):
    for attempt in range(max_retries):
        resp = requests.get(url, headers=headers)
        if resp.status_code == 200:
            return resp
        if resp.status_code == 429:
            wait = 2 ** attempt * 10  # 10s, 20s, 40s, 80s, 160s
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)
        else:
            resp.raise_for_status()
    raise Exception("Max retries exceeded")
Enter fullscreen mode Exit fullscreen mode

4. Use sessionsrequests.Session() reuses TCP connections and looks more like a real browser.

5. Don't scrape logged-in pages — Stick to public endpoints. Scraping behind auth violates Reddit's TOS.

When to Use Each Method

Method Best For Cost Scale
JSON endpoints Side projects, research, <1K posts Free Low
ScraperAPI + proxies Production pipelines, daily collection ~$49/mo High
Apify pre-built One-off exports, non-developers Pay per use Medium

My recommendation: Start with Method 1. It's free and handles most use cases. When you hit rate limits consistently, add ScraperAPI for proxy rotation. Only go to Apify if you need a managed solution.

Key Takeaways

  • Reddit's .json endpoints are still the easiest way to get structured data
  • Always rotate User-Agents and respect rate limits
  • For scale, proxy rotation is non-negotiable
  • Save yourself time — deduplicate with post IDs, not URLs
  • Stick to public data. Don't scrape anything that requires login

The code in this article is tested and working as of March 2026. Reddit changes things periodically, so if something breaks, check the response format first — the field names occasionally shift.


Building a scraping pipeline? I write about Python automation, web scraping, and developer tools. Follow for more practical guides.

Top comments (0)