agenthustler

Posted on Mar 25 • Edited on Apr 19

How to Scrape Reddit in 2026: Subreddits, Posts, Comments via Python

#python #tutorial #webdev #datascience

Reddit is a goldmine of unstructured human conversation — 100,000+ active communities discussing everything from machine learning to mechanical keyboards. For researchers, analysts, and NLP practitioners, Reddit data powers sentiment analysis, trend detection, market research, and training datasets.

In this guide, I'll show you three practical approaches to scraping Reddit in 2026: the built-in JSON API, the PRAW library, and raw HTTP scraping. All with working Python code.

Skip the Setup — Use Our Ready-Made Scraper

Building a Reddit scraper that handles rate limits, pagination, and comment threading is tedious. Our Reddit Scraper on Apify does it all out of the box: subreddits, posts, comments, and user profiles — with structured JSON output and built-in scheduling.

Try it free →

Why Scrape Reddit?

Sentiment analysis: Track public opinion on brands, products, or events
Market research: Find what people actually say about competitors
NLP training data: Millions of labeled conversations (upvotes = quality signal)
Trend detection: Spot emerging topics before they hit mainstream
Academic research: Social network analysis, community dynamics
Content aggregation: Build curated feeds from niche subreddits

Method 1: Reddit's JSON API (No Auth Required)

Reddit has a little-known feature: append .json to almost any Reddit URL and you get structured JSON data back. No API key needed.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Available Sort Options

/hot.json — Currently trending
/new.json — Most recent
/top.json — Highest scored (add ?t=day|week|month|year|all)
/rising.json — Gaining momentum
/controversial.json — Most debated

Pagination with the `after` Parameter

Reddit returns 25-100 posts per request. To get more, use the after parameter with the last post's fullname (t3_ + id):

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Scraping Comments

Comments are where the real value is. Here's how to get them:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Method 2: PRAW (Python Reddit API Wrapper)

For authenticated access with higher rate limits, use PRAW. You'll need to create a Reddit app at https://www.reddit.com/prefs/apps/:

import praw

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    user_agent="PythonResearch/1.0"
)

# Fetch top posts
subreddit = reddit.subreddit("artificial")

for post in subreddit.top(time_filter="week", limit=50):
    print(f"[{post.score}] {post.title}")
    print(f"  Comments: {post.num_comments}")
    print(f"  Author: {post.author}")
    print()

# Search across Reddit
for post in reddit.subreddit("all").search("web scraping python", limit=25):
    print(f"r/{post.subreddit}: {post.title}")

PRAW vs JSON API

Feature	JSON API	PRAW
Auth required	No	Yes (free)
Rate limit	~30 req/min	60 req/min
Comment depth	Limited	Full tree
Search	Basic	Advanced
Streaming	No	Yes
Setup	Zero	Create Reddit app

Searching Across Reddit

The JSON API supports search too:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Rate Limits and How to Handle Them

Reddit's rate limits:

Unauthenticated: ~30 requests per minute
Authenticated (OAuth): 60 requests per minute
With premium: 100 requests per minute

Handling Rate Limits Gracefully

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Scaling Up: Using Proxies

For large-scale scraping (thousands of posts across many subreddits), you'll hit rate limits fast. Two solutions:

Proxy Aggregation with ScrapeOps

ScrapeOps routes your requests through the cheapest working proxy automatically:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Managed Scraping with ScraperAPI

ScraperAPI handles proxy rotation, retries, and CAPTCHA solving:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Building a Complete Reddit Dataset

Here's a full pipeline that scrapes posts and comments, then saves to JSON:

import requests
import json
import time
from datetime import datetime

def build_reddit_dataset(subreddits, posts_per_sub=100, include_comments=True):
    dataset = []

    for sub in subreddits:
        print(f"\n--- Scraping r/{sub} ---")
        posts = scrape_all_posts(sub, sort="top", max_posts=posts_per_sub)

        for i, post in enumerate(posts):
            entry = {**post, "subreddit": sub}

            if include_comments and post["num_comments"] > 0:
                time.sleep(2)
                comments = fetch_post_comments(post["id"], sub)
                entry["comments"] = comments
                print(f"  [{i+1}/{len(posts)}] {post['title'][:50]}... ({len(comments)} comments)")
            else:
                entry["comments"] = []

            dataset.append(entry)

        print(f"Completed r/{sub}: {len(posts)} posts")

    return dataset


# Build a dataset from multiple subreddits
subreddits = ["datascience", "machinelearning", "python", "webdev"]

dataset = build_reddit_dataset(
    subreddits,
    posts_per_sub=50,
    include_comments=True
)

# Save with metadata
output = {
    "scraped_at": datetime.utcnow().isoformat(),
    "subreddits": subreddits,
    "total_posts": len(dataset),
    "total_comments": sum(len(p.get("comments", [])) for p in dataset),
    "posts": dataset,
}

with open("reddit_dataset.json", "w", encoding="utf-8") as f:
    json.dump(output, f, indent=2, ensure_ascii=False)

print(f"\nDataset saved: {output['total_posts']} posts, {output['total_comments']} comments")

Exporting to CSV for Analysis

import csv

def export_posts_csv(posts, filename="reddit_posts.csv"):
    fieldnames = ["id", "subreddit", "title", "author", "score",
                  "num_comments", "created_utc", "selftext"]

    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
        writer.writeheader()
        writer.writerows(posts)

    print(f"Exported {len(posts)} posts to {filename}")


def export_comments_csv(posts, filename="reddit_comments.csv"):
    fieldnames = ["post_id", "comment_id", "author", "body", "score", "depth"]

    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()

        for post in posts:
            for comment in post.get("comments", []):
                writer.writerow({
                    "post_id": post["id"],
                    "comment_id": comment["id"],
                    "author": comment["author"],
                    "body": comment["body"],
                    "score": comment["score"],
                    "depth": comment["depth"],
                })

Legal and Ethical Considerations

Reddit's API Terms: Reddit's API terms of service require attribution and prohibit commercial use without agreement. The JSON API is public but still governed by these terms
Rate limiting: Always respect rate limits — getting banned helps nobody
User privacy: Don't deanonymize users or link Reddit accounts to real identities
Content policies: Don't scrape private/quarantined subreddits
Data retention: Consider how long you store scraped data and who has access

Production-Ready Alternative: Reddit Scraper on Apify

If building and maintaining your own scraper isn't worth the hassle, our Reddit Scraper on Apify handles everything covered in this guide — and more:

Subreddits, posts, comments, and user profiles — all in one tool
Structured JSON output ready for analysis or database import
Built-in pagination that handles Reddit's after token automatically
Scheduling — set it to run daily and get fresh data delivered
Free tier to get started without any cost

Whether you're building a sentiment analysis pipeline or collecting training data, it saves hours of proxy management and rate limit debugging.

Try the Reddit Scraper free →

Wrapping Up

Reddit scraping in 2026 comes down to three approaches:

JSON API (.json suffix) — zero setup, great for quick scripts and small datasets
PRAW — higher rate limits, streaming support, better for production pipelines
Proxy-based scaling — when you need thousands of posts, use ScrapeOps for proxy aggregation or ScraperAPI for fully managed scraping

The JSON API is where most people should start. It requires no authentication, returns clean structured data, and handles 90% of use cases. Add PRAW when you need streaming or higher limits, and proxies when you're operating at scale.

Happy scraping!

Pro tip: For reliable proxy rotation and residential IPs, check out ThorData — they offer competitive rates for web scraping at scale.

DEV Community

How to Scrape Reddit in 2026: Subreddits, Posts, Comments via Python

Skip the Setup — Use Our Ready-Made Scraper

Why Scrape Reddit?

Method 1: Reddit's JSON API (No Auth Required)

Available Sort Options

Pagination with the `after` Parameter

Scraping Comments

Method 2: PRAW (Python Reddit API Wrapper)

PRAW vs JSON API

Searching Across Reddit

Rate Limits and How to Handle Them

Handling Rate Limits Gracefully

Scaling Up: Using Proxies

Proxy Aggregation with ScrapeOps

Managed Scraping with ScraperAPI

Building a Complete Reddit Dataset

Exporting to CSV for Analysis

Legal and Ethical Considerations

Production-Ready Alternative: Reddit Scraper on Apify

Wrapping Up

Related Articles

Top comments (0)

Skip the Setup — Use Our Ready-Made Scraper

Why Scrape Reddit?

Method 1: Reddit's JSON API (No Auth Required)

Available Sort Options

Pagination with the after Parameter

Scraping Comments

Method 2: PRAW (Python Reddit API Wrapper)

PRAW vs JSON API

Searching Across Reddit

Rate Limits and How to Handle Them

Handling Rate Limits Gracefully

Scaling Up: Using Proxies

Proxy Aggregation with ScrapeOps

Managed Scraping with ScraperAPI

Building a Complete Reddit Dataset

Exporting to CSV for Analysis

Legal and Ethical Considerations

Production-Ready Alternative: Reddit Scraper on Apify

Wrapping Up

Related Articles

Pagination with the `after` Parameter