agenthustler

Posted on Apr 8

Reddit Scraping in 2026: What Actually Works (And What Gets You Banned)

#python #webdev #tutorial #webscraping

Reddit is a goldmine of user-generated data — product feedback, market research, trend signals, community sentiment. But scraping it in 2026 is a very different challenge than it was three years ago. After Reddit's controversial API pricing changes in 2023 that killed off third-party apps, the platform tightened its defenses significantly.

This guide covers what actually works in 2026: the technical realities, the right tools for each use case, and production-ready Python code you can run today.

What Changed After the API Pricing War

In June 2023, Reddit introduced tiered API pricing that effectively priced out indie developers. The free tier allows 100 requests per minute for non-commercial use. For anything beyond that, you're looking at $0.24 per 1,000 API calls.

What this meant practically:

The official API became expensive for large-scale work
Reddit doubled down on bot detection for scrapers trying to bypass costs
Rate limiting became more aggressive and context-aware
User-Agent sniffing and behavioral analysis improved

The good news: Reddit's HTML structure is still accessible, and legitimate scraping at moderate scale remains achievable.

Approach 1: The Official Reddit API (PRAW)

For many use cases, the official API is still the right answer. If you're doing keyword monitoring, subreddit analysis, or building something that needs to stay within Reddit's ToS, PRAW (Python Reddit API Wrapper) is your friend.

import praw
import time
from datetime import datetime, timedelta

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    user_agent="keyword-monitor/1.0 by u/your_username",
    username="your_reddit_username",
    password="your_reddit_password"
)

def monitor_keyword(subreddit_name: str, keyword: str, limit: int = 100):
    """Monitor a subreddit for posts containing a keyword."""
    subreddit = reddit.subreddit(subreddit_name)
    results = []

    for post in subreddit.new(limit=limit):
        if keyword.lower() in post.title.lower() or keyword.lower() in post.selftext.lower():
            results.append({
                "id": post.id,
                "title": post.title,
                "score": post.score,
                "url": post.url,
                "created_utc": datetime.utcfromtimestamp(post.created_utc).isoformat(),
                "num_comments": post.num_comments,
                "author": str(post.author),
                "subreddit": str(post.subreddit),
            })

    return results

def scrape_comments(post_id: str, depth: int = 5):
    """Scrape all comments from a Reddit post."""
    submission = reddit.submission(id=post_id)
    submission.comments.replace_more(limit=depth)  # Expand "load more" chains

    comments = []
    for comment in submission.comments.list():
        comments.append({
            "id": comment.id,
            "body": comment.body,
            "score": comment.score,
            "author": str(comment.author),
            "created_utc": datetime.utcfromtimestamp(comment.created_utc).isoformat(),
            "parent_id": comment.parent_id,
            "depth": comment.depth,
        })

    return comments

# Usage
posts = monitor_keyword("Python", "scraping", limit=50)
print(f"Found {len(posts)} relevant posts")

for post in posts[:3]:
    comments = scrape_comments(post["id"])
    print(f"Post: {post['title'][:60]}... ({len(comments)} comments)")
    time.sleep(0.5)  # Respect rate limits

Rate limits with PRAW: 60 requests per minute for authenticated users, 30 for unauthenticated. PRAW handles this automatically with built-in rate limit handling — it will sleep when you're close to the limit.

When to use PRAW:

Monitoring specific subreddits for keywords
Building alert systems for brand mentions
Academic research within Reddit's data access terms
Applications that need comment trees or user history

Approach 2: The JSON Trick (No Authentication Required)

Reddit appends .json to any URL to return raw JSON. This is a legitimate, documented feature that Reddit has kept since its early days.

import httpx
import asyncio
import json
from typing import AsyncIterator

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0; +https://yoursite.com/bot)",
}

async def fetch_subreddit_posts(
    subreddit: str,
    sort: str = "new",
    limit: int = 100,
    after: str = None
) -> dict:
    """Fetch posts from a subreddit using the JSON API."""
    url = f"https://www.reddit.com/r/{subreddit}/{sort}.json"
    params = {"limit": min(limit, 100), "raw_json": 1}
    if after:
        params["after"] = after

    async with httpx.AsyncClient() as client:
        response = await client.get(url, headers=HEADERS, params=params)
        response.raise_for_status()
        return response.json()

async def paginate_subreddit(
    subreddit: str,
    max_posts: int = 1000,
    sort: str = "new"
) -> AsyncIterator[dict]:
    """Paginate through subreddit posts, yielding each post."""
    after = None
    collected = 0

    while collected < max_posts:
        data = await fetch_subreddit_posts(subreddit, sort=sort, after=after)
        posts = data["data"]["children"]

        if not posts:
            break

        for post in posts:
            yield post["data"]
            collected += 1
            if collected >= max_posts:
                break

        after = data["data"]["after"]
        if not after:
            break

        # Be polite: 1 second between pages
        await asyncio.sleep(1)

async def main():
    posts = []
    async for post in paginate_subreddit("entrepreneur", max_posts=200):
        posts.append({
            "title": post["title"],
            "score": post["score"],
            "url": post["url"],
            "created_utc": post["created_utc"],
            "selftext": post["selftext"][:500],  # First 500 chars
        })

    print(f"Collected {len(posts)} posts")

    # Save to JSONL (better for large datasets than JSON)
    with open("reddit_posts.jsonl", "w") as f:
        for post in posts:
            f.write(json.dumps(post) + "\n")

asyncio.run(main())

The Pushshift alternative: Pushshift.io (now arctic-shift.github.io) maintains historical Reddit archives going back to 2005. For research needing historical data, this is far more efficient than paginating Reddit directly.

import httpx

async def search_pushshift(query: str, subreddit: str = None, after: int = None, before: int = None, size: int = 100):
    """Search Reddit comments via Arctic Shift (Pushshift successor)."""
    url = "https://arctic-shift.photon-reddit.com/api/comments/search"
    params = {
        "q": query,
        "size": size,
        "sort": "desc",
        "sort_type": "created_utc",
    }
    if subreddit:
        params["subreddit"] = subreddit
    if after:
        params["after"] = after
    if before:
        params["before"] = before

    async with httpx.AsyncClient() as client:
        response = await client.get(url, params=params)
        response.raise_for_status()
        return response.json()["data"]

Approach 3: Browser Automation for Dynamic Content

Some Reddit content (heavily moderated threads, award-heavy posts, ads) loads differently through APIs vs. a browser. For edge cases, Playwright gives you a real browser:

from playwright.async_api import async_playwright
import asyncio
import json

async def scrape_reddit_thread(url: str) -> dict:
    """Scrape a Reddit thread using a real browser."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            viewport={"width": 1280, "height": 720},
        )
        page = await context.new_page()

        # Block tracking/ad requests to speed things up
        await page.route("**/*.{png,jpg,jpeg,gif,webp,svg}", lambda route: route.abort())
        await page.route("**/{analytics,tracking,ads}**", lambda route: route.abort())

        await page.goto(url, wait_until="networkidle")

        # Extract the JSON data Reddit embeds in the page
        json_data = await page.evaluate("""
            () => {
                const scripts = document.querySelectorAll('script[id^="t3_"]');
                const data = {};
                scripts.forEach(s => {
                    try { Object.assign(data, JSON.parse(s.textContent)); } catch(e) {}
                });
                return data;
            }
        """)

        await browser.close()
        return json_data

Rate Limiting: The Right Strategy

Getting rate-limited on Reddit means your IP gets a temporary ban — usually 10 minutes to a few hours. Here's how to stay under the radar:

1. Respect the crawl delay

import asyncio
import random
from dataclasses import dataclass, field
from collections import deque
from time import monotonic

@dataclass
class RateLimiter:
    requests_per_minute: int = 30
    _timestamps: deque = field(default_factory=deque)

    async def acquire(self):
        now = monotonic()
        # Remove timestamps older than 60 seconds
        while self._timestamps and now - self._timestamps[0] > 60:
            self._timestamps.popleft()

        if len(self._timestamps) >= self.requests_per_minute:
            # Wait until the oldest request is 60 seconds old
            sleep_time = 60 - (now - self._timestamps[0])
            if sleep_time > 0:
                await asyncio.sleep(sleep_time)

        # Add jitter to avoid pattern detection
        await asyncio.sleep(random.uniform(0.5, 1.5))
        self._timestamps.append(monotonic())

2. Rotate User-Agents (but don't overdo it)

Reddit's bot detection looks for inconsistent User-Agent rotation — changing UA on every request is a red flag. Rotate between 3-5 realistic browser strings and stick with each for at least 10 minutes.

3. Handle 429s gracefully with exponential backoff

import httpx
import asyncio

async def fetch_with_backoff(url: str, headers: dict, max_retries: int = 5) -> httpx.Response:
    """Fetch a URL with exponential backoff on rate limit errors."""
    for attempt in range(max_retries):
        async with httpx.AsyncClient() as client:
            response = await client.get(url, headers=headers, timeout=30)

            if response.status_code == 200:
                return response
            elif response.status_code == 429:
                # Check Retry-After header
                retry_after = int(response.headers.get("Retry-After", 60))
                wait_time = max(retry_after, 2 ** attempt * 10)
                print(f"Rate limited. Waiting {wait_time}s (attempt {attempt + 1}/{max_retries})")
                await asyncio.sleep(wait_time)
            elif response.status_code in (403, 503):
                wait_time = 2 ** attempt * 30
                print(f"Blocked (HTTP {response.status_code}). Waiting {wait_time}s")
                await asyncio.sleep(wait_time)
            else:
                response.raise_for_status()

    raise Exception(f"Failed after {max_retries} attempts")

Common Pitfalls and How to Avoid Them

Pitfall 1: Scraping old.reddit.com vs. new.reddit.com

The old Reddit interface (old.reddit.com) has simpler HTML and is less JavaScript-heavy. The JSON trick works on both, but if you're doing HTML scraping, old Reddit is more reliable.

Pitfall 2: The replace_more trap in PRAW

When you call submission.comments.replace_more(limit=None), PRAW makes a separate API request for each "MoreComments" object. On a viral thread with 5,000 comments, this can mean hundreds of API calls. Use limit=10 or limit=20 for most use cases, and only limit=None when you genuinely need every comment.

Pitfall 3: Storing raw Reddit data

Reddit's API ToS limits how long you can store certain data types. User account data in particular has deletion propagation requirements — if a user deletes their account, you must delete their data too (if storing beyond 90 days). Design your pipeline with this in mind.

Pitfall 4: Ignoring flairs and crosspost data

Many developers miss that the crosspost_parent_list field can cascade indefinitely. If you're recursively following crossposts, add a depth limit.

Approach 4: Using a Dedicated Scraping Service

For production workloads requiring thousands of posts per day, maintaining your own scraper is expensive. You need to handle IP rotation, CAPTCHA solving, browser fingerprinting, and Reddit's evolving bot detection.

This is where dedicated Reddit scraping APIs earn their keep. Services like the Reddit Comment Scraper on Apify handle the infrastructure for you — you get clean structured data without maintaining proxies or dealing with rate limits yourself.

The Apify platform's pay-per-result model means you only pay for actual data delivered:

import httpx

def scrape_reddit_via_apify(subreddit: str, keyword: str, max_posts: int = 500) -> list:
    """Use Apify's Reddit scraper actor for large-scale collection."""
    # Example integration pattern with Apify actors
    api_token = "YOUR_APIFY_TOKEN"
    actor_id = "cryptosignals/reddit-comment-scraper"

    # Start a run
    run_response = httpx.post(
        f"https://api.apify.com/v2/acts/{actor_id}/runs",
        headers={"Authorization": f"Bearer {api_token}"},
        json={
            "subreddit": subreddit,
            "keyword": keyword,
            "maxPosts": max_posts,
        }
    )
    run_id = run_response.json()["data"]["id"]

    # Wait for completion and fetch results
    # (In production, use webhooks instead of polling)
    import time
    while True:
        status = httpx.get(
            f"https://api.apify.com/v2/acts/{actor_id}/runs/{run_id}",
            headers={"Authorization": f"Bearer {api_token}"}
        ).json()["data"]["status"]

        if status in ("SUCCEEDED", "FAILED", "ABORTED"):
            break
        time.sleep(5)

    # Fetch dataset
    results = httpx.get(
        f"https://api.apify.com/v2/acts/{actor_id}/runs/{run_id}/dataset/items",
        headers={"Authorization": f"Bearer {api_token}"}
    ).json()

    return results

When to use a service vs. DIY:

DIY: < 10,000 posts/day, one-time research, academic use
Service: Production pipelines, > 50,000 posts/day, teams without scraping expertise, data that needs to be current

Putting It Together: A Complete Keyword Monitor

Here's a production-ready script that combines PRAW (for reliability) with pushshift (for history), saves to SQLite, and sends alerts:

import praw
import sqlite3
import httpx
import json
from datetime import datetime
from dataclasses import dataclass, asdict

@dataclass
class RedditPost:
    id: str
    subreddit: str
    title: str
    selftext: str
    score: int
    url: str
    permalink: str
    author: str
    created_utc: float
    num_comments: int
    keyword_matched: str

def init_db(db_path: str = "reddit_monitor.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS posts (
            id TEXT PRIMARY KEY,
            subreddit TEXT,
            title TEXT,
            selftext TEXT,
            score INTEGER,
            url TEXT,
            permalink TEXT,
            author TEXT,
            created_utc REAL,
            num_comments INTEGER,
            keyword_matched TEXT,
            scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.commit()
    return conn

def save_post(conn: sqlite3.Connection, post: RedditPost):
    conn.execute(
        "INSERT OR IGNORE INTO posts VALUES (?,?,?,?,?,?,?,?,?,?,?,CURRENT_TIMESTAMP)",
        (post.id, post.subreddit, post.title, post.selftext, post.score,
         post.url, post.permalink, post.author, post.created_utc,
         post.num_comments, post.keyword_matched)
    )
    conn.commit()

def monitor(subreddits: list[str], keywords: list[str], reddit: praw.Reddit):
    conn = init_db()

    for subreddit_name in subreddits:
        subreddit = reddit.subreddit(subreddit_name)
        for post in subreddit.new(limit=100):
            text = f"{post.title} {post.selftext}".lower()
            for keyword in keywords:
                if keyword.lower() in text:
                    p = RedditPost(
                        id=post.id,
                        subreddit=subreddit_name,
                        title=post.title,
                        selftext=post.selftext[:1000],
                        score=post.score,
                        url=post.url,
                        permalink=f"https://reddit.com{post.permalink}",
                        author=str(post.author),
                        created_utc=post.created_utc,
                        num_comments=post.num_comments,
                        keyword_matched=keyword,
                    )
                    save_post(conn, p)
                    print(f"[{keyword}] {post.title[:80]}")
                    break  # Don't double-count if multiple keywords match

    conn.close()

# Run every hour via cron:
# 0 * * * * python3 /path/to/monitor.py
if __name__ == "__main__":
    reddit = praw.Reddit(
        client_id="YOUR_CLIENT_ID",
        client_secret="YOUR_CLIENT_SECRET",
        user_agent="keyword-monitor/1.0",
    )
    monitor(
        subreddits=["entrepreneur", "startups", "SaaS", "Python"],
        keywords=["scraping", "data collection", "API alternative"],
        reddit=reddit,
    )

Summary: Which Approach for Which Situation

Use Case	Recommended Approach	Why
Keyword monitoring, ToS-compliant	PRAW + official API	Reliable, structured, handles rate limits
One-time research, moderate scale	JSON trick + asyncio	No auth needed, fast
Historical data (pre-2023)	Arctic Shift / Pushshift	Much faster than paginating live
Production pipeline, high volume	Apify or similar service	Infrastructure handled, no IP bans
Edge cases, dynamic content	Playwright	Full browser rendering

Reddit scraping in 2026 is viable when done thoughtfully. Respect rate limits, handle errors gracefully, and choose the right tool for your scale. The platforms that try to brute-force it end up blocked within hours. The ones that treat Reddit's infrastructure with respect can run indefinitely.