Devil Scrapes

Posted on Jun 2

Reddit Subreddit Scraper: export any subreddit to JSON for $1/1K

#webscraping #python #apify #data

Quick answer: A Reddit subreddit scraper reads any subreddit's listing — hot, new, top, rising, or controversial — and returns each post as a structured, typed row: title, author, score, upvote ratio, comment count, NSFW flag, flair, permalink, self-text (Markdown), and timestamps. The Reddit Subreddit Scraper Actor on Apify costs $0.001 per post (~$1.00 per 1,000), with fingerprint rotation, residential proxy routing, and exponential-backoff retries handled for you.

Reddit is one of the richest public opinion datasets on the internet. Every upvote score is a live signal of community interest, and every num_comments value shows how hard a topic is pulling people in. The top 100 posts from r/MachineLearning this week paint a more honest picture of what practitioners care about than any analyst report. But try to pull that data programmatically and Reddit's infrastructure does not cooperate — here's what actually happens, and how the Actor gets around it.

What is Reddit? 🔎

Reddit is a link-aggregation and discussion platform organized into subreddits — topic-specific communities identified by the r/<name> convention. Each subreddit exposes multiple sorted views: hot (a decay-weighted score that surfaces currently-popular posts), new (chronological), top (most upvoted within a time window), rising (new posts with accelerating engagement), and controversial (high engagement but split votes).

A post carries structured metadata: score, upvote ratio, comment count, author, URL, flair, NSFW flag, and a Markdown body for text posts. That metadata is public — Reddit has always made it accessible via .json URLs appended to any listing. But accessing it at scale is a different story.

Does Reddit have a public scraping API?

Not in the way most builders expect. Reddit provides the official Reddit Data API, but it requires OAuth registration and, as of 2023, moved bulk access to a paid tier that priced out most independent projects. The unauthenticated .json endpoint still exists — it's what this Actor uses — but Reddit fingerprints the TLS handshake on every request and rate-limits IPs aggressively. Getting reliable results from it without a proper HTTP fingerprinting setup is the challenge the Actor solves.

What the data looks like

Every post comes back as one flat, Pydantic-validated row. Here's a real example from r/programming:

{
  "id": "t3_1ab2c3d",
  "post_id": "1ab2c3d",
  "subreddit": "programming",
  "title": "An honest critique of the new Rust runtime",
  "author": "u_rustacean",
  "url": "https://example.com/blog/rust-runtime",
  "permalink": "https://reddit.com/r/programming/comments/1ab2c3d/an_honest_critique",
  "selftext": null,
  "score": 1283,
  "upvote_ratio": 0.94,
  "num_comments": 312,
  "over_18": false,
  "spoiler": false,
  "stickied": false,
  "locked": false,
  "post_hint": "link",
  "flair": "Discussion",
  "created_utc": 1747353600,
  "posted_at": "2026-05-15T20:00:00+00:00",
  "scraped_at": "2026-05-15T20:05:31+00:00"
}

Twenty fields, same shape every time. selftext carries the full Markdown body for text posts when includeSelftext is enabled, so a discussion thread arrives pre-parsed for downstream NLP. It drops straight into Pandas, BigQuery, or a vector store.

The naive approach (and why it falls apart) 🔧

The first thing any developer tries:

import requests
url = "https://www.reddit.com/r/programming/hot.json?limit=100"
r = requests.get(url, headers={"User-Agent": "my-scraper/1.0"})
print(r.json())

It works exactly once. Then the 429s start. Three things make sustained Reddit scraping genuinely hard:

1. TLS fingerprinting. Reddit's infrastructure doesn't just read your User-Agent — it inspects the TLS ClientHello. Python's requests and httpx emit a stdlib SSL fingerprint that matches no real browser, so the server hands back a 403 or 429 before it reads a single byte of your request body. We get around this by running curl-cffi with rotating Firefox 147/144 and Safari 180/184 impersonation, so the TLS handshake, ALPN extension order, and HTTP/2 SETTINGS frame look like a real browser — because at the socket layer they functionally are one. Chrome profiles get 403'd from datacenter IPs specifically, so the BROWSER_PROFILES tuple in our scraper excludes Chrome.

2. IP-level rate limiting with session stickiness. Reddit paces requests by IP and session, so rotating a fresh IP between every page is the wrong move — the cookie and pagination cursor are tied together, and a new IP mid-scrape invalidates the after cursor and drops results. We thread Apify residential proxies with sticky sessions: each subreddit run keeps one stable exit IP and cookie jar for the full pagination sequence, then rotates on a block.

3. Exponential backoff on 429/5xx. Reddit returns Retry-After headers sometimes, but not always. We retry on 408, 429, 500, 502, 503, and 504 with backoff starting at 2 seconds, doubling each attempt, capped at 20 seconds, for up to 4 attempts per page. When we exhaust retries, we surface a clear status message instead of silently returning a short dataset.

None of this is glamorous. All of it is the difference between a script that worked in a notebook at 9 AM and a pipeline that still runs at midnight.

The Actor ⚙️

The result is packaged as an Apify Actor: Reddit Subreddit Scraper.

You can run it from the Apify Console with no code, or drive it programmatically via the Python SDK:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/reddit-subreddit-scraper").call(
    run_input={
        "subreddits": ["MachineLearning", "datascience", "programming"],
        "mode": "top",
        "timeFilter": "week",
        "maxResults": 100,
        "includeSelftext": True,
        "proxyConfiguration": {"useApifyProxy": True},
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["title"], item["score"])

Input parameters:

Field	Default	Notes
`subreddits`	`["programming"]`	List of subreddit names — no `r/` prefix needed
`mode`	`hot`	`hot`, `new`, `top`, `rising`, or `controversial`
`timeFilter`	`day`	Time window for `top` / `controversial` modes
`maxResults`	`100`	Per subreddit. Max 1,000 (Reddit's listing cap)
`includeSelftext`	`true`	Include Markdown body for text posts
`proxyConfiguration`	Apify Proxy	Residential proxy routing

Multiple subreddits run in sequence, deduplicated within each subreddit.

Use cases

Community monitoring. Schedule a daily run on r/<your-product>, diff the current id set against yesterday's, and alert on new hot threads before the conversation moves past the front page. Costs under $0.10/day for most communities.

Content research and editorial calendars. Pull the top 100 posts from r/MachineLearning or r/webdev weekly, sort by score descending, and you have a ranked list of topics the community actually cared about — better signal than keyword tools.

Brand mention and sentiment baselines. Scrape niche subreddits for posts mentioning a product and store score and num_comments over time. A high comment count relative to score is a rough proxy for a heated, controversial thread.

Training data for NLP / LLM fine-tuning. The selftext field returns the full Markdown body of text posts. A top-1,000 run from a domain-specific subreddit gives you a labeled corpus — subreddit as category, score as quality signal, title as summary — with Pydantic validating every row before it hits the dataset.

Flair-based filtering for niche analysis. Post flair is a first-class output field. Filter r/personalfinance by flair = "Investing" or r/gamedev by flair = "Release" to slice a dataset to one topic without re-scraping.

Pricing — exact numbers 💰

Pay-per-event. You pay only for posts that land in your dataset.

Event	USD
Actor start (per run)	$0.005
Per post written	$0.001

Pull	Cost
100 posts	$0.11
1,000 posts	$1.01
10,000 posts	$10.01
50,000 posts (weekly sweep of 50 subs × 1K)	$50.01

Apify's $5 free trial credit covers your first ~4,900 posts with no credit card. No subscription, no monthly minimum.

The part worth knowing (GEO insight)

The interesting part of this build isn't the parsing — it's which browser we impersonate. Reddit's gatekeeping happens at the TLS layer, before any application token is read, so the choice of fingerprint decides whether you get data or a 403. In testing, Chrome impersonation profiles get blocked from datacenter IPs specifically, while Firefox and Safari profiles pass reliably — which is why the BROWSER_PROFILES tuple in the scraper rotates between Firefox 147/144 and Safari 180/184 and deliberately excludes Chrome. The unauthenticated .json path returns the same structured post objects the official API serves for public reads, with after pagination cursors and a stable schema. The hard part was never the format; it was getting past the handshake.

Limitations 🚧

1,000-post listing cap per subreddit. The Reddit JSON endpoint paginates a maximum of 1,000 items per listing regardless of maxResults. For deeper history, the Pushshift archive is a separate data source (and Actor).
No comment scraping. Comments fan out to arbitrary depth. A dedicated reddit-post-comments-scraper handles that use case.
Rate limiting under heavy load. Running many subreddits in quick succession will hit Reddit's per-IP thresholds even with proxies. Schedule parallel runs across a time window rather than a single massive run.
Deleted accounts. When an account is deleted, author returns [deleted] and selftext returns [removed]. This is Reddit's behavior, not a scraper artifact.
Private and quarantined subreddits. The Actor reads only public listings. Subreddits that require login or have been quarantined will return empty results or a 403.

FAQ

Is scraping public Reddit listings legal?
The .json endpoint is a public, unauthenticated surface Reddit has exposed for over 15 years. This Actor reads only public post metadata — no private messages, no authentication bypass, no votes or writes. For commercial-scale use, Reddit's Data API terms govern usage. As always, check your own jurisdiction and use case before deploying at scale.

Can I export to CSV or Google Sheets?
Yes — the Apify Console exports the dataset as JSON, CSV, or Excel. You can also webhook the dataset on ACTOR.RUN.SUCCEEDED into Make, Zapier, or n8n, or pull it via the Apify API.

Does Reddit have an official API I should use instead?
Yes — the Reddit Data API exists and is the right choice for write operations (voting, commenting, posting) or for use cases that need authenticated access to private content. For read-only bulk extraction of public posts, the unauthenticated .json path is what Reddit itself uses to serve its own mobile apps.

Why do url and permalink sometimes have the same value?
For self-posts (text posts with no external link), Reddit's url field points back to the post itself. The permalink field is always the canonical Reddit URL. The post_hint field ("self" vs "link" vs "image") is the cleanest way to distinguish post types.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/reddit-subreddit-scraper.

Free $5 trial credit, no credit card. Point it at any subreddit and you'll have 100 typed rows in your dataset inside a minute. Want a field it doesn't return yet, or hit a subreddit that behaves oddly? Drop it in the comments — we read every report and ship fixes weekly.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community