James Taylor

Posted on Jun 4 • Originally published at apify.com

How we built a Reddit comment-tree scraper that returns upvote scores — through a residential proxy

#webscraping #javascript #reddit #apify

Most "Reddit scrapers" quietly lie to you. They hand back a flat list of top-level comments with no upvote scores, no nesting, and no idea which reply was buried at the bottom of a 200-comment thread. That's because they're reading Reddit's RSS feed — the one endpoint Reddit still serves cheaply — and RSS throws away almost everything that makes a Reddit discussion interesting.

We needed the real thing: every comment, with its author, body, upvote score, depth, and parent, plus the post's score and upvote ratio. So we built it, published it on the Apify Store as Reddit Comment Tree Scraper, and this post walks through exactly how it works — the 403 wall, why a residential proxy is non-negotiable, and the one trick that keeps the cost sane.

Why Reddit is hard to scrape (and why RSS is a cop-out)

Reddit used to have a famously friendly JSON API: append .json to any thread URL and you'd get the whole tree. Then they locked it down. Today, if you fetch() a thread's .json from a server, you get a 403. It's gated on two things at once:

IP reputation. Datacenter IPs (AWS, GCP, Hetzner, the usual suspects) are blocked outright. A residential IP from a real ISP passes.
TLS / client fingerprint. Even from a residential IP, a plain HTTP client gets challenged. Reddit fingerprints the TLS handshake and headers and can tell a node-fetch from a real browser.

A datacenter IP + a real browser still 403s. A residential IP + curl still gets challenged. You need both: a residential IP and a real browser. That's the whole problem in one sentence, and it's why the cheap actors don't bother — they fall back to RSS, which is unauthenticated and gives you flat, scoreless comments.

If all you need is "what are the new posts in r/SaaS," RSS is fine (and we use it ourselves for cheap discovery — more on that below). But if you need the engagement data — which comment actually resonated, how deep the thread went, what the sentiment looked like at each level — RSS can't help you.

The approach: warm a real browser, then read the canonical JSON

Here's the core insight that makes the actor both reliable and affordable:

You don't need to render every page. You need a real browser to clear Reddit's gate once, and then you can fetch the lightweight .json from inside that same browser context as many times as you like.

So the flow is:

Spin up a headless Chromium through a residential proxy.
Navigate to old.reddit.com once — this clears the anti-bot gate and warms the session (cookies, fingerprint, the works).
From inside that warmed page, fetch() each thread's canonical .json. Because the request now originates from a real, gate-cleared browser context, Reddit serves it.
Parse the JSON into a clean post + comment tree.

The key line is the in-page fetch. We use Playwright's page.evaluate() to run the fetch in the browser's own JS context, so it inherits the warmed session:

const json = await page.evaluate(async (u) => {
  const r = await fetch(u, { headers: { Accept: 'application/json' } });
  if (!r.ok) return { __status: r.status };
  return await r.json();
}, jsonUrl);

That jsonUrl is just the thread URL with ?limit=200&raw_json=1 tacked on. raw_json=1 stops Reddit from HTML-escaping the comment bodies, so you get clean text instead of & soup.

Getting the whole tree, not just the first page

Reddit serves roughly the top 200 comments per thread and collapses the rest into "load more comments" stubs. If you stop there, you silently lose the deepest, often most candid replies.

Those stubs aren't dead ends — each one carries the IDs of the comments it's hiding. We collect those IDs and POST them to Reddit's /api/morechildren endpoint (again, from inside the warmed browser context), 100 at a time, until we hit the user's maxComments cap:

const body = new URLSearchParams({
  link_id: linkId,        // t3_<postId>
  children: children,     // up to 100 comment IDs
  api_type: 'json',
  sort: 'confidence',
  raw_json: '1',
});

This is the difference between a scraper that returns "the 200 comments Reddit felt like showing" and one that returns the actual discussion. Each comment comes back with its depth and parentId, so you can rebuild the exact nesting — or just use the flat list with scores attached.

The cost problem — and the trick that solves it

Residential proxy bandwidth is the floor on cost for any serious Reddit scrape. Apify's residential proxy runs about $8/GB. If you naively launched a fresh browser and a fresh proxy IP for every single thread, you'd pay for a full page render and a new IP rotation on every request. That gets expensive fast.

Two levers fix this:

1. Warm once per session, then batch. Each worker opens one proxy IP, clears the gate once, then fires up to threadsPerSession (default 15) thread-.json fetches through that same warmed context before rotating to a fresh IP. Browser startup and gate-clearing — the expensive parts — get amortised across 15 threads instead of paid once per thread. After that, you're mostly paying for lightweight JSON payloads, not page renders.

async function worker() {
  while (threads.length) {
    const session = await openWarmedContext();   // one IP, gate cleared once
    let inSession = 0;
    while (threads.length && inSession < threadsPerSession) {
      const ref = threads.shift();
      await fetchThreadInPage(session.page, ref); // cheap JSON fetch
      inSession += 1;
    }
    await session.ctx.close();                    // rotate IP, repeat
  }
}

2. Bring your own residential proxy. This is the big one. The actor uses Apify's createProxyConfiguration, which transparently accepts a "Custom proxies" option in the proxy input. Paste your own residential proxy URLs — providers like IPRoyal sell residential bandwidth at $1–2/GB — and you're typically 3–5× cheaper than Apify's residential, with zero code changes. The actor rotates your IPs per session exactly the same way.

That BYO-proxy support is deliberate. We run this actor inside our own product at high volume, and the proxy economics are the whole game at scale.

Reliability: requeue on a fresh IP

Residential IPs are flaky by nature — some are slow, some are already rate-limited by Reddit, some just die mid-session. The actor treats a blocked or stale fetch as retryable: a thread that fails gets pushed back onto the queue (up to 3 tries) and picked up by the next warmed session on a fresh IP. A thread that comes back valid-but-empty (deleted/removed post) is not retried — there's nothing there to get.

This is the difference between "works in a demo" and "works on 10,000 threads overnight." You assume IPs will fail and design the retry around it, rather than treating every failure as fatal.

Discovery for free

One more economy: you don't need the expensive browser path just to find threads. Reddit's per-subreddit RSS listing is still served cheaply and unauthenticated. So when you give the actor a list of subreddits, it pulls the listing via plain RSS to discover thread IDs, and only spends the residential-browser budget on the actual deep scrape of each thread. Cheap where you can be, expensive only where you must be.

What you get back

One clean record per thread:

{
  "type": "post",
  "subreddit": "SaaS",
  "title": "How we cut churn 30%",
  "score": 142,
  "upvoteRatio": 0.97,
  "numComments": 88,
  "comments": [
    {
      "author": "growth_greg",
      "body": "What did your onboarding look like before?",
      "score": 24,
      "depth": 0,
      "parentId": "t3_abc123"
    }
  ]
}

Every comment carries the score and the tree position. That's the data sentiment models, social-listening tools, and trend analysts actually need — and the data RSS-based scrapers structurally cannot give you.

Compliance note

The actor reads public Reddit data only. It never logs in, posts, votes, or messages. Use the data in line with Reddit's terms and whatever laws apply to you. We built it for research, analysis, and social listening — not for spamming subreddits.

Try it

The actor is live on the Apify Store: Reddit Comment Tree Scraper — Full Threads + Scores. Give it a subreddit or a list of thread URLs and you'll get back the full tree with scores. Drop in your own residential proxy to make it cheap at volume.

This scraper is one component of a much larger system. We use it inside SignalEngine — an autonomous outbound engine that turns Reddit (and other) conversations into qualified leads with AI-drafted, context-aware replies. If you'd rather have the conversations turned into pipeline automatically than wire up the data yourself, that's what the engine is for.

DEV Community