DEV Community

agenthustler
agenthustler

Posted on • Originally published at thedatacollector.substack.com

How to Scrape Reddit in 2026 (3 Methods That Still Work)

Reddit used to be the Wild West of data scraping. Dozens of third-party apps pulled millions of posts per day without friction. Then 2023 happened.

In June 2023, Reddit announced API pricing that would have charged $0.24 per 1,000 API calls for commercial use. Overnight, tools like Apollo, Infinity, and RIF died. The community raged. Elon's free-speech narrative made Twitter look appealing. Chaos.

But here's what most people got wrong: Reddit data isn't actually locked down. It's just more intentional now. You can still scrape Reddit legally and effectively — you just need to know what actually works in 2026.

This guide covers three methods with working code examples, legal context you actually need to understand, and honest assessment of when each approach makes sense.


The Reddit API Landscape in 2026

Before we dive into methods, let's clarify what Reddit is actually protecting:

Reddit's Official API is still free for non-commercial, non-competitive use. The $0.24/1,000 calls pricing applies to commercial services that extract and republish Reddit data. If you're building tools for yourself, a research project, or offering services that add value (not just republishing), you're in the clear.

Rate limits are per-user, per-endpoint, and generous for legitimate use: roughly 60 requests per minute.

The catch: Reddit aggressively monitors for scrapers and will block you if you look suspicious. Every method below includes mitigations.


Method 1: Reddit's Official API with PRAW (Easiest Path)

If you're building anything beyond one-off data collection, this is your starting point.

PRAW (Python Reddit API Wrapper) is the maintained, endorsed way to access Reddit data. It handles authentication, rate limiting, and error handling for you.

Setup

pip install praw
Enter fullscreen mode Exit fullscreen mode

Code Example: Scrape Recent Posts from a Subreddit

import praw
from datetime import datetime

# Authentication
reddit = praw.Reddit(
    client_id='YOUR_CLIENT_ID',
    client_secret='YOUR_CLIENT_SECRET',
    user_agent='DataCollector/1.0 (by YourRedditUsername)'
)

# Scrape posts from r/Python
subreddit = reddit.subreddit('Python')
posts = []

for submission in subreddit.new(limit=100):
    posts.append({
        'title': submission.title,
        'score': submission.score,
        'author': submission.author.name if submission.author else '[deleted]',
        'timestamp': datetime.fromtimestamp(submission.created_utc),
        'url': submission.url,
        'text': submission.selftext,
        'upvote_ratio': submission.upvote_ratio,
        'num_comments': submission.num_comments
    })

for post in posts:
    print(f"{post['title']} ({post['score']} upvotes)")
Enter fullscreen mode Exit fullscreen mode

How to Get Credentials

  1. Go to https://www.reddit.com/prefs/apps
  2. Click "Create an application"
  3. Choose "script" (for personal use)
  4. Name it anything (e.g., "My Data Collector")
  5. Copy the client_id and client_secret
  6. Set a strong user_agent: include your purpose and Reddit username

Limits & Reality Check

  • Rate limit: ~60 requests/minute
  • Cost: Free (forever, for non-commercial)
  • Data available: Posts, comments, user profiles, subreddit metadata
  • Cannot scrape: Private messages, deleted content, vote counts (only current votes visible)
  • Speed: 100-500 posts per API key per session before throttling kicks in hard

When to use this: Monitoring specific subreddits, building tools for Reddit users, research with <50K posts.


Method 2: Historical Data with Arctic Shift & Academic Torrents

If you need historical Reddit data (posts from 2015, 2018, etc.), the official API won't help you.

Pushshift used to be the standard here. It was a massive archive of ~500M Reddit submissions spanning 2007–2023. Then Reddit shut it down in 2024 after legal pressure, citing the CFAA.

The successor is Arctic Shift — an academic project hosting Reddit data dumps via Academic Torrents.

How It Works

Arctic Shift publishes monthly archives of Reddit data (posts, comments, metadata) as torrent files. You download entire months, then query locally or load into a database.

Setup

  1. Download a torrent client (qBittorrent, Transmission)
  2. Visit: https://academictorrents.com/details/56aa49f5665710803c11137e53931c63ecd12126
  3. Choose the month/year you need
  4. Download the torrent (100GB–500GB per year, compressed)
  5. Extract and load into SQLite, Postgres, or analyze with pandas

Code Example: Query Local Arctic Shift Data

Assuming you've downloaded and extracted data into /data/reddit_posts.csv:

import pandas as pd
import sqlite3

# Load CSV (if using CSV export from torrent)
df = pd.read_csv('/data/reddit_posts_2023.csv')

# Filter by subreddit and date range
python_posts = df[
    (df['subreddit'] == 'Python') &
    (df['created_utc'] >= 1672531200) &  # Jan 1, 2023
    (df['created_utc'] <= 1704067199)    # Dec 31, 2023
]

print(f"Found {len(python_posts)} posts in r/Python during 2023")

# Or load into SQLite for larger datasets
conn = sqlite3.connect('/data/reddit.db')
df.to_sql('posts', conn, if_exists='append', index=False)

# Query with SQL
query = """
SELECT subreddit, COUNT(*) as post_count, AVG(score) as avg_score
FROM posts
WHERE created_utc >= ? AND created_utc <= ?
GROUP BY subreddit
ORDER BY post_count DESC
LIMIT 10
"""
result = pd.read_sql_query(query, conn, params=(1672531200, 1704067199))
print(result)
Enter fullscreen mode Exit fullscreen mode

Limits & Reality Check

  • Data freshness: 1–2 months behind (published on ~1-month lag)
  • Cost: Free (via torrent)
  • Scope: Historical only; good for research and sentiment analysis
  • Size: Manageable with modern hardware; 1 year ≈ 50–100GB compressed
  • Legal: Academic use is protected; commercial use is grayer (see Legal section below)

When to use this: Historical analysis, training ML models, trend research, comparative studies across years.


Method 3: Web Scraping (old.reddit.com)

If you need fresh data and aren't bound by rate limits (willing to be slower), you can scrape the old Reddit UI directly.

Modern Reddit (reddit.com) is JavaScript-heavy and anti-scraper. But old.reddit.com is static HTML — much simpler to scrape.

Why This Works

  • old.reddit.com returns plain HTML without heavy JavaScript
  • No rate limit as aggressive as the API (from the scraper's perspective)
  • You control timing, so you can be polite
  • Works for posts, comments, user profiles

Code Example: Scrape Posts with Beautiful Soup

import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import urljoin

# Headers to avoid being blocked
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

def scrape_old_reddit(subreddit, limit=100):
    """Scrape posts from old.reddit.com"""
    posts = []
    url = f'https://old.reddit.com/r/{subreddit}/'

    while len(posts) < limit:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract posts from the listing
        for post_elem in soup.find_all('div', class_='thing'):
            # Skip stickied posts or ads
            if 'stickied' in post_elem.get('class', []):
                continue

            title_elem = post_elem.find('a', class_='title')
            score_elem = post_elem.find('div', class_='score')
            author_elem = post_elem.find('a', class_='author')

            if not title_elem:
                continue

            posts.append({
                'title': title_elem.get_text(strip=True),
                'url': urljoin('https://old.reddit.com', title_elem.get('href', '')),
                'score': score_elem.get_text(strip=True) if score_elem else 'N/A',
                'author': author_elem.get_text(strip=True) if author_elem else '[deleted]',
                'subreddit': subreddit
            })

        # Find "next" button for pagination
        next_button = soup.find('a', rel='next')
        if not next_button or len(posts) >= limit:
            break

        url = urljoin('https://old.reddit.com', next_button.get('href', ''))
        time.sleep(2)  # Be polite: 2-second delay between requests

    return posts[:limit]

# Example usage
posts = scrape_old_reddit('Python', limit=50)
for post in posts:
    print(f"{post['title']} by {post['author']}")
Enter fullscreen mode Exit fullscreen mode

Install Dependencies

pip install requests beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

Limits & Reality Check

  • Speed: 10–30 posts per minute (with 2-second delays)
  • Cost: Free
  • Reliability: Fragile to DOM changes; Reddit can break this with a UI update
  • Rate limiting: Less aggressive than the API, but if you hammer it you'll get blocked
  • Legal risk: Higher than PRAW (see Legal section)

When to use this: Small one-off scrapes, real-time monitoring of a few subreddits, prototyping before moving to official API.


Comparison Table

Method Speed Cost Freshness Legal Risk Best For
PRAW (Official API) 60 req/min Free Real-time Lowest Production, research, monitoring
Arctic Shift N/A (bulk) Free 1-2 months lag Medium Historical analysis, ML training
Web Scraping (old.reddit.com) 10–30/min Free Real-time Higher Prototyping, small datasets

Legal Considerations: What You Actually Need to Know

Reddit's terms of service explicitly forbid scraping. But TOS violations aren't laws. Here's what actually matters:

Reddit's Terms of Service

Violating Reddit's TOS can get your account/IP banned, but it's not a criminal issue. Reddit will send cease-and-desist notices before escalating.

What's forbidden explicitly: Automated access without permission, except through the official API.

Exception: Non-commercial research is implicitly allowed (see Reddit's guidelines).

The Computer Fraud and Abuse Act (CFAA)

The CFAA makes unauthorized computer access illegal. But "accessing public data" isn't unauthorized access — the data is publicly available on the web. This was established in the landmark HiQ v. LinkedIn (2017) case.

The ruling: Scraping publicly available data is legal, even if the site's TOS forbids it. LinkedIn tried to sue HiQ for scraping public profiles. The court sided with HiQ. The same logic applies to Reddit.

But there's nuance:

  • If you violate TOS and cause damage (DOS, stealing compute, massive load), you could face CFAA charges.
  • State laws vary. CFAA interpretation changes with new administrations.
  • If you're republishing Reddit data as-is, you may infringe copyright (Reddit users own their posts' copyrights).

The Safe Path

  1. Use the official API when possible. It's endorsed and safest.
  2. Be polite when scraping. Add delays, respect robots.txt patterns, identify yourself in User-Agent.
  3. Don't republish. Summarize, analyze, or add value. Don't just copy Reddit posts into your own site.
  4. Document your purpose. "Research" or "personal project" is fine. "Competitive scraping" is not.
  5. Respect account ToS. Use non-commercial accounts. Don't claim commercial status if you're not.

For production scale, services like ScraperAPI handle rate limiting, IP rotation, and legal compliance. They absorb the legal risk in their terms. That's worth the cost if you're doing this commercially.


Real-World Example: Building a Reddit Sentiment Dashboard

Let's say you want to monitor r/stocks, r/investing, and r/crypto for real-time sentiment. Here's the hybrid approach:

import praw
import time
from datetime import datetime, timedelta

reddit = praw.Reddit(
    client_id='YOUR_CLIENT_ID',
    client_secret='YOUR_CLIENT_SECRET',
    user_agent='SentimentMonitor/1.0'
)

def monitor_subreddits(subreddits, check_interval_minutes=30):
    """Monitor multiple subreddits for sentiment"""
    last_check = datetime.utcnow() - timedelta(hours=1)

    while True:
        for subreddit_name in subreddits:
            subreddit = reddit.subreddit(subreddit_name)

            # Get posts from the last check interval
            new_posts = []
            for submission in subreddit.new(limit=50):
                if datetime.fromtimestamp(submission.created_utc) > last_check:
                    new_posts.append({
                        'title': submission.title,
                        'score': submission.score,
                        'num_comments': submission.num_comments,
                        'created': datetime.fromtimestamp(submission.created_utc),
                        'url': submission.url
                    })

            if new_posts:
                avg_score = sum(p['score'] for p in new_posts) / len(new_posts)
                print(f"\nr/{subreddit_name}: {len(new_posts)} new posts, avg score: {avg_score:.1f}")

                # Top post
                top = sorted(new_posts, key=lambda x: x['score'], reverse=True)[0]
                print(f"  Top: {top['title'][:60]}... ({top['score']} points)")

        last_check = datetime.utcnow()
        print(f"\n[{datetime.now().strftime('%H:%M:%S')}] Sleeping {check_interval_minutes} minutes...")
        time.sleep(check_interval_minutes * 60)

# Run it
monitor_subreddits(['stocks', 'investing', 'crypto'])
Enter fullscreen mode Exit fullscreen mode

This approach stays within API limits, requires no scraping, and runs indefinitely.


What Changed Since 2023?

  • Reddit killed Pushshift (2024): No more free archival API. Arctic Shift filled the gap for academics.
  • API pricing stabilized: No price increases since the 2023 announcement. If you're non-commercial, it's still free.
  • Third-party apps went extinct: But that was about user-facing clients, not data extraction. Bots and data scrapers adapted.
  • Detection improved: Reddit now has better bot detection. Using PRAW with a legitimate account is safer than raw HTTP requests.

Resources


TL;DR

  1. Use PRAW for anything ongoing or production. It's free, fast, and legal.
  2. Use Arctic Shift if you need historical data. Academic torrents, powerful, but slow.
  3. Scrape old.reddit.com only for small one-offs. It breaks easily and carries legal ambiguity.
  4. Be respectful. Add delays, use a real account, don't republish.
  5. For scale, use ScraperAPI to handle the details.

Reddit data is accessible in 2026. You just need to know which door to knock on.


Want to stay ahead of scraping trends? Subscribe to The Data Collector for working code, legal updates, and practical data extraction strategies. No fluff — just what works.


Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.


Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.

Compare web scraping APIs:

  • ScraperAPI — 5,000 free credits, 50+ countries, structured data parsing
  • Scrape.do — From $29/mo, strong Cloudflare bypass
  • ScrapeOps — Proxy comparison + monitoring dashboard

Need custom web scraping? Email hustler@curlship.com — fast turnaround, fair pricing.

Top comments (0)