Devil Scrapes

Posted on Jun 2

Reddit User Scraper: export any user's posts & comments to JSON

#webscraping #python #apify #data

Quick answer: A Reddit user scraper pulls every public submission and comment from one or more Reddit usernames and returns them as structured, typed JSON rows — including title, body, subreddit, score, permalink, and ISO-8601 timestamp. Reddit exposes this data through its public .json endpoints, but applying proxy rotation, browser fingerprint impersonation, and pagination against rate-limits is where a hand-rolled script falls apart. The Devil Scrapes Reddit User History Scraper handles all of it for $0.001 per result (~$1.00 per 1,000).

Reddit is one of the most candid public archives on the internet. Users post and comment under persistent pseudonyms — sometimes for years — and that history is fully public. For OSINT investigators, fraud analysts, recruiters, and researchers, a user's post history is a structured timeline of opinions, domain expertise, and behavioural patterns. The catch is getting it out cleanly.

I needed a repeatable pipeline for pulling user histories at scale — not one-offs in a browser tab, but a batch job I could hand a list of 50 usernames and get back a clean dataset within minutes. Here is what that took, and how I wrapped it into a single Apify Actor.

What is Reddit user history? 🔎

Reddit is a link-aggregation and discussion platform with roughly 1.5 billion monthly unique visitors that ties posts and comments to persistent usernames, with post history kept public and searchable. Every submission and comment a public user makes is accessible via Reddit's .json endpoints at old.reddit.com/user/<username>/submitted.json and .../comments.json.

Per item you get the subreddit, the title (submissions) or body (comments) in full markdown, the score, permalink, and outbound URL, the created timestamp as both a Unix integer and an ISO-8601 conversion, a stable Reddit fullname ID (t3_ for submissions, t1_ for comments), and the NSFW flag plus comment count (submissions only).

What you do not get: private messages, modmail, removed posts, or content from private subreddits — this Actor surfaces only what an anonymous visitor sees.

Does Reddit have a user history API? ⚙️

Technically yes — but it rate-limits aggressively and the shape has changed. Reddit's .json endpoints predate any official API and have been the de-facto data surface for scrapers and researchers for over a decade. Since 2023 Reddit has tightened authentication and rate-limiting for third-party consumers, and unauthenticated requests from datacenter IPs receive 429s within a few pages. The official Reddit API requires OAuth and enforces strict rate-limit headers. We walk the .json listing endpoints with full pagination, sticky sessions, and residential proxies — no OAuth tokens exposed in your environment.

What the data looks like

Each item — submission or comment — comes back as one flat, Pydantic-validated row:

{
  "kind": "submission",
  "id": "t3_abc123",
  "username": "spez",
  "subreddit": "announcements",
  "title": "Upcoming changes to Reddit's API",
  "body": null,
  "url": "https://www.reddit.com/r/announcements/comments/abc123/",
  "permalink": "https://www.reddit.com/r/announcements/comments/abc123/upcoming_changes_to_reddits_api/",
  "score": 14823,
  "num_comments": 9871,
  "over_18": false,
  "created_utc": 1685980800,
  "posted_at": "2023-06-05T16:00:00+00:00",
  "scraped_at": "2026-05-31T10:22:13+00:00"
}

Fourteen fields, the same shape for every row regardless of kind, validated with Pydantic v2 before each item is written. It drops straight into Pandas, BigQuery, or a JSONL pipeline.

The naive approach (and why it falls apart) 🔥

The first thing any scraper-aware person tries: open DevTools, find the XHR to old.reddit.com/user/spez/submitted.json, replay it with requests.get(), and paginate using the after cursor. It breaks before you hit page 3. Here is why, and what we do about each:

1. TLS fingerprinting and datacenter IP bans. Reddit's CDN fingerprints the TLS ClientHello on requests from datacenter ranges. Standard Python clients — requests, httpx — emit a JA3/JA4 fingerprint that looks nothing like a browser. We rotate Firefox and Safari impersonation profiles via curl-cffi, which replays the exact TLS extension order and HTTP/2 SETTINGS frame a real browser sends, so a sustained run never presents a uniform fingerprint.

2. Aggressive rate-limiting. Reddit enforces per-IP and per-user-agent rate limits with X-Ratelimit-Remaining and Retry-After headers. We honour Retry-After on every response, back off with exponential delay starting at 2 seconds (doubling to a 30-second cap), retry up to 5 times per page, and surface partial-success counts rather than silently handing you an empty dataset.

3. Proxy session stickiness. The .json pagination cursor (after=t3_...) is tied to the session that requested the first page. If you rotate IPs between pages, Reddit invalidates the cursor and you loop on page 1 forever. We thread Apify Proxy residential sessions with a stable session_id per user, so pagination stays coherent from first page to last.

4. The 1,000-item listing cap. Reddit's listing endpoints hard-stop at 1,000 items per sort, so maxResultsPerUser=0 (the cap) retrieves at most 1,000 submissions and 1,000 comments per user. For longer histories, combine sortBy=top and sortBy=new runs to get complementary slices. We document this limit explicitly — no data, no charge.

None of this is glamorous. All of it is the gap between "works on my laptop for one account" and "survives a 50-account batch run".

The Actor

The result is packaged as an Apify Actor: Reddit User History Scraper.

Drop usernames into the Apify Console and click Start, or drive it from Python:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/reddit-user-scraper").call(
    run_input={
        "usernames": ["spez", "kn0thing"],
        "what": "both",           # "submitted", "comments", or "both"
        "sortBy": "new",          # "new", "top", "hot", "controversial"
        "timeFilter": "year",     # applies only when sortBy="top" or "controversial"
        "maxResultsPerUser": 500,
        "proxyConfiguration": {
            "useApifyProxy": True,
            "apifyProxyGroups": ["RESIDENTIAL"]
        }
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["kind"], item["username"], item["subreddit"], item["score"])

Input parameters map directly to Reddit's listing API:

Parameter	Options	Default
`what`	`submitted`, `comments`, `both`	`submitted`
`sortBy`	`new`, `top`, `hot`, `controversial`	`new`
`timeFilter`	`hour`, `day`, `week`, `month`, `year`, `all`	`all`
`maxResultsPerUser`	0–1000 (0 = Reddit's cap)	`100`

Use cases 💡

OSINT and due-diligence investigations. Before a contract engagement or background check, pull the subject's public Reddit footprint — subreddit participation, post cadence, stated opinions. A recruiter screening 50 candidates before outreach turns a half-day of manual browsing into a 90-second Apify run at roughly $0.05 per account. Tools like RedditMetis do this visually in a browser; this Actor does it headlessly, in batch, with export.

Fraud and sockpuppet detection. Pull the posting history of flagged accounts. High volume across unrelated subreddits, identical comment structures, or cadence anomalies (all activity 09:00–17:00 UTC on weekdays, never a weekend) surface faster in a spreadsheet of raw rows than by manual reading. The created_utc field gives you sub-second precision for cadence analysis.

Influencer and community research. Identify which subreddits a developer or domain expert is most active in before an outreach or partnership conversation. The subreddit + score combination tells you where they earned reputation, not just where they posted once.

Academic and journalism research. Longitudinal studies of opinion drift or community formation require stable, reproducible user-history snapshots. The Association of Internet Researchers (AoIR) maintains an ethics guide for social media data collection; this Actor surfaces only what an anonymous web visitor sees.

Content portfolio analysis. Extract a user's own high-scoring submissions for portfolio review or domain-expertise mapping. Sort by top with timeFilter=all to get the career highlights in one pass.

Pricing — exact numbers 💰

Pay-Per-Event. You pay only for items that land in your dataset.

Event	USD	When it fires
`actor-start`	$0.005	Once per run, regardless of results
`result`	$0.001	Per item written to the dataset

Example costs:

Volume	Cost
100 items (1 user, quick scan)	$0.11
1,000 items	$1.01
5,000 items (50 users × 100 items)	$5.01
10,000 items	$10.01

Apify's $5 free trial credit covers your first ~4,990 items with no credit card. No subscription, no minimum monthly seat.

The technically interesting bit

Reddit's .json endpoints return a nested data.children[] array where each child's data object carries both submission-specific fields (title, url, num_comments, over_18) and comment-specific fields (body, link_id, parent_id) — but never all of them populated at once. A naive parser that maps every key for every row produces a sparse, schema-inconsistent dataset.

We branch on the kind prefix (t3_ = submission, t1_ = comment) before extraction and set the irrelevant fields to null in each branch. The result is a single flat 14-field schema that covers both item types cleanly, with no surprises when you GROUP BY kind in SQL.

Limitations 🚧

Hard cap at 1,000 items per sort per user. Reddit's listing endpoints paginate to a maximum of 1,000 items regardless of how many a user has posted, so multi-year accounts will be truncated. For deep history, the community-maintained Arctic Shift Reddit dataset (Pushshift successor) covers data this Actor cannot reach.
Deleted and removed posts are not visible. Reddit returns [deleted]/[removed] items as placeholder objects with no usable content — we skip them.
Private subreddits return no content. If a user's activity is predominantly in private communities, their visible history will appear sparse.
No reply-chain traversal. We surface the user's own posts and comments; we do not fetch the comment trees they replied to. That is a separate Actor.

FAQ ❓

Is scraping Reddit user history legal?
Reddit's .json endpoints expose what any anonymous web visitor sees — no authentication, no account required. This Actor collects only public profile pages, does not circumvent any access control, and collects no private messages or private subreddit content. Always check your jurisdiction and use case. For academic research, consult your institution's IRB and the AoIR ethics guidelines.

Can I export to CSV, Google Sheets, or a warehouse?
Yes — export JSON, CSV, or Excel from the Apify Console after a run. You can also webhook the dataset on ACTOR.RUN.SUCCEEDED into Make, Zapier, or n8n, or pull it via the Apify API with your dataset ID.

Is there an official Reddit API I should use instead?
The Reddit API requires OAuth registration and imposes strict rate limits (60 requests/minute for authenticated clients). For low-volume personal research that is the right tool. For batch pipelines, multi-user pulls, or warehouse export, a hosted Actor with built-in proxy rotation and retry logic fits better — different jobs.

How is this different from RedditMetis?
RedditMetis is a browser-based visualisation tool — great for a single manual lookup, but with no API, batch mode, or export. This Actor is the headless complement: give it a list of usernames, get back a clean dataset you can query, chart, or feed into a model.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/reddit-user-scraper.

Free $5 trial credit, no credit card. Point it at your own username first — you'll have your full public posting history as structured JSON in under a minute. Got a use case I missed, or a field you wish it returned? Drop it in the comments.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community