<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: James Taylor</title>
    <description>The latest articles on DEV Community by James Taylor (@james_taylor_037c857e0299).</description>
    <link>https://dev.to/james_taylor_037c857e0299</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3968529%2F0bdcd4df-d862-471b-a5a1-549d0d27b334.png</url>
      <title>DEV Community: James Taylor</title>
      <link>https://dev.to/james_taylor_037c857e0299</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/james_taylor_037c857e0299"/>
    <language>en</language>
    <item>
      <title>How we built a hiring-intent lead finder using Google as the backend (no login, no ban risk)</title>
      <dc:creator>James Taylor</dc:creator>
      <pubDate>Fri, 05 Jun 2026 10:33:33 +0000</pubDate>
      <link>https://dev.to/james_taylor_037c857e0299/how-we-built-a-hiring-intent-lead-finder-using-google-as-the-backend-no-login-no-ban-risk-565l</link>
      <guid>https://dev.to/james_taylor_037c857e0299/how-we-built-a-hiring-intent-lead-finder-using-google-as-the-backend-no-login-no-ban-risk-565l</guid>
      <description>&lt;p&gt;&lt;em&gt;Job posts are the strongest B2B buying signal there is. Here's how we turned public Google search results into a hiring-intent lead finder — and the parsing traps that nearly sank it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A company advertising a &lt;em&gt;"Marketing Manager, London"&lt;/em&gt; is telling you three things at once: it has &lt;strong&gt;budget&lt;/strong&gt;, it has a &lt;strong&gt;gap right now&lt;/strong&gt;, and you know &lt;strong&gt;exactly what the gap is&lt;/strong&gt;. That's the strongest cold-outreach trigger in B2B — and it's sitting in public, on job boards, for free.&lt;/p&gt;

&lt;p&gt;So we built a small Apify actor that turns it into a lead list: give it roles + locations, get back one lead per hiring company with the role, the location, the job link, and a ready-to-paste opener. Here's how it works, and — more usefully — the three parsing traps that nearly made the output garbage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The core trick: don't scrape the job boards. Search them.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Indeed, LinkedIn and Glassdoor all run serious anti-bot (Cloudflare, DataDome). Scraping them directly means residential proxies, headless browsers, and a constant cat-and-mouse you will eventually lose.&lt;/p&gt;

&lt;p&gt;You don't have to play. Google has &lt;em&gt;already&lt;/em&gt; crawled those postings. So instead of fetching &lt;code&gt;indeed.com&lt;/code&gt;, you ask Google:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Marketing Manager" "London" (site:indeed.com OR site:linkedin.com/jobs OR site:glassdoor.com)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read the search-results HTML, parse the titles, done. No login, no cookie, no anti-bot wall on the boards themselves — nothing of yours to get blocked. We route the Google request through Apify's &lt;code&gt;GOOGLE_SERP&lt;/code&gt; proxy (it's HTTP-only — you request &lt;code&gt;http://www.google.com/search?...&lt;/code&gt; and the proxy does the TLS to Google) with &lt;code&gt;got-scraping&lt;/code&gt;, and fall back to Bing on an empty result.&lt;/p&gt;

&lt;p&gt;That part took an afternoon. Then we ran it for real, and the output was junk. Here's why — and the fixes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trap 1: &lt;code&gt;site:indeed.com&lt;/code&gt; returns category pages, not jobs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first live run for "Marketing Manager / Leeds" returned "companies" like &lt;strong&gt;Email Marketing Leeds&lt;/strong&gt; and &lt;strong&gt;Performance Marketing Leeds Ls10&lt;/strong&gt;. Those aren't businesses — they're Indeed's &lt;em&gt;category/listing&lt;/em&gt; pages (&lt;code&gt;indeed.com/q-email-marketing-l-leeds-jobs.html&lt;/code&gt;), which rank brilliantly for SEO and name no single employer.&lt;/p&gt;

&lt;p&gt;The fix is to target the &lt;strong&gt;posting path&lt;/strong&gt;, not the board root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;BOARD_SITES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;indeed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;indeed.com/viewjob&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;linkedin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;linkedin.com/jobs/view&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;glassdoor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;glassdoor.com/job-listing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;site:linkedin.com/jobs/view "Marketing Manager" "London"&lt;/code&gt; returns &lt;em&gt;individual&lt;/em&gt; postings whose titles read cleanly — &lt;em&gt;"Marketing Manager - Spotify"&lt;/em&gt;, &lt;em&gt;"House of CB hiring Marketing Manager"&lt;/em&gt;. Same query against the board root returns the listing-page noise. One-line change, completely different output quality.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Trap 2: a Google login link that *looked&lt;/em&gt; like a job host&lt;br&gt;
**&lt;br&gt;
A &lt;code&gt;accounts.google.com/ServiceLogin?...continue=...site:indeed.com...&lt;/code&gt; URL slipped through and became a "lead." The bug: we were checking whether the job-host string appeared &lt;em&gt;anywhere&lt;/em&gt; in the URL — and the search query (with &lt;code&gt;site:indeed.com&lt;/code&gt; in it) was echoed inside the &lt;code&gt;continue=&lt;/code&gt; parameter.&lt;/p&gt;

&lt;p&gt;Fix: match on the parsed &lt;strong&gt;host&lt;/strong&gt;, not a substring of the whole URL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;hostMatches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;hosts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;u&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;host&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hostPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;host&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pathname&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;hosts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;hostPath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;// linkedin.com/jobs/view&lt;/span&gt;
                    &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;host&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;h&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;host&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`.&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt; &lt;span class="c1"&gt;// indeed.com&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lesson that keeps recurring in scraping: parse the thing, don't substring-match the thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trap 3: Google's near-matches&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Searching for "Plumber" surfaced "Solar Installer" and "Cyber Security Architect" postings — Google helpfully returns loosely-related results, and our title parser dutifully extracted &lt;em&gt;those&lt;/em&gt; roles as companies.&lt;/p&gt;

&lt;p&gt;The fix is a &lt;strong&gt;relevance gate&lt;/strong&gt;: keep a posting only if its title actually contains the role you searched for.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;titleMatchesRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;role&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;[^&lt;/span&gt;&lt;span class="sr"&gt;a-z0-9&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+/&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Boolean&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;sig&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;w&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sharpened precision dramatically for named professional roles (marketing, sales, ops) — exactly the roles where "you're hiring for this, here's why you might not need to" is a killer opener.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The honest part&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Even after all that, company-name extraction from arbitrary job-board titles isn't perfect — Indeed titles especially are inconsistent. So every result carries the &lt;code&gt;jobUrl&lt;/code&gt;: one click verifies the company. We say so plainly in the docs rather than pretending the parse is flawless. LinkedIn and Glassdoor titles (&lt;code&gt;Company hiring Role&lt;/code&gt;) extract cleanest; Indeed adds breadth.&lt;/p&gt;

&lt;p&gt;Optional last step: flip on &lt;code&gt;findEmails&lt;/code&gt; and, for each distinctively-named company, it finds a decision-maker from public LinkedIn results and enriches a verified work email via your own Prospeo key. We gate that to &lt;em&gt;distinctive&lt;/em&gt; company names — running an email lookup on a vague extracted name ("Delivery &amp;amp; Digital") just matches a random person at the wrong company, and a confidently-wrong email is worse than none.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;It's live on the Apify Store, pay-per-result: &lt;strong&gt;&lt;a href="https://apify.com/signalengine/hiring-intent-lead-finder" rel="noopener noreferrer"&gt;Hiring Intent Lead Finder&lt;/a&gt;&lt;/strong&gt;. Point it at a role + city and you'll get a graded list of companies with a live buying signal.&lt;/p&gt;

&lt;p&gt;It's one piece of a bigger thing we're building — &lt;a href="https://engine.signalsprint.io" rel="noopener noreferrer"&gt;SignalEngine&lt;/a&gt;, agentic outbound that discovers, enriches, and emails leads autonomously. The hiring finder is a taste of the discovery layer.&lt;/p&gt;

&lt;p&gt;If you'd rather find &lt;em&gt;which local businesses are leaking leads&lt;/em&gt; than who's hiring, we shipped a sibling actor for that too — &lt;a href="https://apify.com/signalengine/lead-readiness-auditor" rel="noopener noreferrer"&gt;Local Business Website Audit&lt;/a&gt; grades a homepage's lead-capture (contact form, click-to-call, chat, booking) and hands back the weak ones as a prospect list.&lt;/p&gt;

&lt;p&gt;Building these in public — next up is pushing them toward Apify Rising Stars. The recurring lesson across all of them: reaching the data is easy; the entire game is in how honestly you parse it.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>javascript</category>
      <category>api</category>
      <category>saas</category>
    </item>
    <item>
      <title>How we built a Reddit comment-tree scraper that returns upvote scores — through a residential proxy</title>
      <dc:creator>James Taylor</dc:creator>
      <pubDate>Thu, 04 Jun 2026 15:26:08 +0000</pubDate>
      <link>https://dev.to/james_taylor_037c857e0299/how-we-built-a-reddit-comment-tree-scraper-that-returns-upvote-scores-through-a-residential-proxy-565d</link>
      <guid>https://dev.to/james_taylor_037c857e0299/how-we-built-a-reddit-comment-tree-scraper-that-returns-upvote-scores-through-a-residential-proxy-565d</guid>
      <description>&lt;p&gt;Most "Reddit scrapers" quietly lie to you. They hand back a flat list of top-level comments with no upvote scores, no nesting, and no idea which reply was buried at the bottom of a 200-comment thread. That's because they're reading Reddit's RSS feed — the one endpoint Reddit still serves cheaply — and RSS throws away almost everything that makes a Reddit discussion &lt;em&gt;interesting&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;We needed the real thing: every comment, with its &lt;strong&gt;author, body, upvote score, depth, and parent&lt;/strong&gt;, plus the post's score and upvote ratio. So we built it, published it on the Apify Store as &lt;a href="https://apify.com/signalengine/reddit-deep-comments" rel="noopener noreferrer"&gt;Reddit Comment Tree Scraper&lt;/a&gt;, and this post walks through exactly how it works — the 403 wall, why a residential proxy is non-negotiable, and the one trick that keeps the cost sane.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Reddit is hard to scrape (and why RSS is a cop-out)
&lt;/h2&gt;

&lt;p&gt;Reddit used to have a famously friendly JSON API: append &lt;code&gt;.json&lt;/code&gt; to any thread URL and you'd get the whole tree. Then they locked it down. Today, if you &lt;code&gt;fetch()&lt;/code&gt; a thread's &lt;code&gt;.json&lt;/code&gt; from a server, you get a &lt;code&gt;403&lt;/code&gt;. It's gated on two things at once:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;IP reputation.&lt;/strong&gt; Datacenter IPs (AWS, GCP, Hetzner, the usual suspects) are blocked outright. A residential IP from a real ISP passes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TLS / client fingerprint.&lt;/strong&gt; Even from a residential IP, a plain HTTP client gets challenged. Reddit fingerprints the TLS handshake and headers and can tell a &lt;code&gt;node-fetch&lt;/code&gt; from a real browser.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A datacenter IP + a real browser still &lt;code&gt;403&lt;/code&gt;s. A residential IP + &lt;code&gt;curl&lt;/code&gt; still gets challenged. You need &lt;strong&gt;both&lt;/strong&gt;: a residential IP &lt;em&gt;and&lt;/em&gt; a real browser. That's the whole problem in one sentence, and it's why the cheap actors don't bother — they fall back to RSS, which is unauthenticated and gives you flat, scoreless comments.&lt;/p&gt;

&lt;p&gt;If all you need is "what are the new posts in r/SaaS," RSS is fine (and we use it ourselves for cheap discovery — more on that below). But if you need the &lt;em&gt;engagement data&lt;/em&gt; — which comment actually resonated, how deep the thread went, what the sentiment looked like at each level — RSS can't help you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The approach: warm a real browser, then read the canonical JSON
&lt;/h2&gt;

&lt;p&gt;Here's the core insight that makes the actor both reliable &lt;em&gt;and&lt;/em&gt; affordable:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You don't need to &lt;em&gt;render&lt;/em&gt; every page. You need a real browser to &lt;strong&gt;clear Reddit's gate once&lt;/strong&gt;, and then you can fetch the lightweight &lt;code&gt;.json&lt;/code&gt; from inside that same browser context as many times as you like.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So the flow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Spin up a headless Chromium through a &lt;strong&gt;residential proxy&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Navigate to &lt;code&gt;old.reddit.com&lt;/code&gt; once — this clears the anti-bot gate and warms the session (cookies, fingerprint, the works).&lt;/li&gt;
&lt;li&gt;From inside that warmed page, &lt;code&gt;fetch()&lt;/code&gt; each thread's canonical &lt;code&gt;.json&lt;/code&gt;. Because the request now originates from a real, gate-cleared browser context, Reddit serves it.&lt;/li&gt;
&lt;li&gt;Parse the JSON into a clean post + comment tree.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key line is the in-page fetch. We use Playwright's &lt;code&gt;page.evaluate()&lt;/code&gt; to run the fetch &lt;em&gt;in the browser's own JS context&lt;/em&gt;, so it inherits the warmed session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;json&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;u&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;Accept&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;__status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="nx"&gt;jsonUrl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;jsonUrl&lt;/code&gt; is just the thread URL with &lt;code&gt;?limit=200&amp;amp;raw_json=1&lt;/code&gt; tacked on. &lt;code&gt;raw_json=1&lt;/code&gt; stops Reddit from HTML-escaping the comment bodies, so you get clean text instead of &lt;code&gt;&amp;amp;amp;&lt;/code&gt; soup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting the &lt;em&gt;whole&lt;/em&gt; tree, not just the first page
&lt;/h2&gt;

&lt;p&gt;Reddit serves roughly the top 200 comments per thread and collapses the rest into "load more comments" stubs. If you stop there, you silently lose the deepest, often most candid replies.&lt;/p&gt;

&lt;p&gt;Those stubs aren't dead ends — each one carries the IDs of the comments it's hiding. We collect those IDs and POST them to Reddit's &lt;code&gt;/api/morechildren&lt;/code&gt; endpoint (again, from inside the warmed browser context), 100 at a time, until we hit the user's &lt;code&gt;maxComments&lt;/code&gt; cap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;URLSearchParams&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;link_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;linkId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;// t3_&amp;lt;postId&amp;gt;&lt;/span&gt;
  &lt;span class="na"&gt;children&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;children&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;// up to 100 comment IDs&lt;/span&gt;
  &lt;span class="na"&gt;api_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;confidence&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;raw_json&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the difference between a scraper that returns "the 200 comments Reddit felt like showing" and one that returns the actual discussion. Each comment comes back with its &lt;code&gt;depth&lt;/code&gt; and &lt;code&gt;parentId&lt;/code&gt;, so you can rebuild the exact nesting — or just use the flat list with scores attached.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost problem — and the trick that solves it
&lt;/h2&gt;

&lt;p&gt;Residential proxy bandwidth is the floor on cost for any serious Reddit scrape. Apify's residential proxy runs about &lt;strong&gt;$8/GB&lt;/strong&gt;. If you naively launched a fresh browser and a fresh proxy IP for every single thread, you'd pay for a full page render and a new IP rotation on every request. That gets expensive fast.&lt;/p&gt;

&lt;p&gt;Two levers fix this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Warm once per session, then batch.&lt;/strong&gt; Each worker opens &lt;em&gt;one&lt;/em&gt; proxy IP, clears the gate &lt;em&gt;once&lt;/em&gt;, then fires up to &lt;code&gt;threadsPerSession&lt;/code&gt; (default 15) thread-&lt;code&gt;.json&lt;/code&gt; fetches through that same warmed context before rotating to a fresh IP. Browser startup and gate-clearing — the expensive parts — get amortised across 15 threads instead of paid once per thread. After that, you're mostly paying for lightweight JSON payloads, not page renders.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;threads&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;openWarmedContext&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;   &lt;span class="c1"&gt;// one IP, gate cleared once&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;inSession&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;threads&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;inSession&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;threadsPerSession&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ref&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;threads&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shift&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetchThreadInPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// cheap JSON fetch&lt;/span&gt;
      &lt;span class="nx"&gt;inSession&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;                    &lt;span class="c1"&gt;// rotate IP, repeat&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Bring your own residential proxy.&lt;/strong&gt; This is the big one. The actor uses Apify's &lt;code&gt;createProxyConfiguration&lt;/code&gt;, which transparently accepts a &lt;strong&gt;"Custom proxies"&lt;/strong&gt; option in the proxy input. Paste your own residential proxy URLs — providers like IPRoyal sell residential bandwidth at &lt;strong&gt;$1–2/GB&lt;/strong&gt; — and you're typically &lt;strong&gt;3–5× cheaper&lt;/strong&gt; than Apify's residential, with zero code changes. The actor rotates your IPs per session exactly the same way.&lt;/p&gt;

&lt;p&gt;That BYO-proxy support is deliberate. We run this actor inside our own product at high volume, and the proxy economics are the whole game at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reliability: requeue on a fresh IP
&lt;/h2&gt;

&lt;p&gt;Residential IPs are flaky by nature — some are slow, some are already rate-limited by Reddit, some just die mid-session. The actor treats a blocked or stale fetch as retryable: a thread that fails gets pushed back onto the queue (up to 3 tries) and picked up by the &lt;em&gt;next&lt;/em&gt; warmed session on a &lt;em&gt;fresh&lt;/em&gt; IP. A thread that comes back valid-but-empty (deleted/removed post) is not retried — there's nothing there to get.&lt;/p&gt;

&lt;p&gt;This is the difference between "works in a demo" and "works on 10,000 threads overnight." You assume IPs will fail and design the retry around it, rather than treating every failure as fatal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discovery for free
&lt;/h2&gt;

&lt;p&gt;One more economy: you don't need the expensive browser path just to &lt;em&gt;find&lt;/em&gt; threads. Reddit's per-subreddit RSS listing is still served cheaply and unauthenticated. So when you give the actor a list of &lt;code&gt;subreddits&lt;/code&gt;, it pulls the listing via plain RSS to discover thread IDs, and only spends the residential-browser budget on the actual deep scrape of each thread. Cheap where you can be, expensive only where you must be.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you get back
&lt;/h2&gt;

&lt;p&gt;One clean record per thread:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"post"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"subreddit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SaaS"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"How we cut churn 30%"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;142&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"upvoteRatio"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.97&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"numComments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;88&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"comments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"author"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"growth_greg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"body"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What did your onboarding look like before?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"depth"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"parentId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"t3_abc123"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every comment carries the score and the tree position. That's the data sentiment models, social-listening tools, and trend analysts actually need — and the data RSS-based scrapers structurally cannot give you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compliance note
&lt;/h2&gt;

&lt;p&gt;The actor reads &lt;strong&gt;public Reddit data only&lt;/strong&gt;. It never logs in, posts, votes, or messages. Use the data in line with Reddit's terms and whatever laws apply to you. We built it for research, analysis, and social listening — not for spamming subreddits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The actor is live on the Apify Store: &lt;strong&gt;&lt;a href="https://apify.com/signalengine/reddit-deep-comments" rel="noopener noreferrer"&gt;Reddit Comment Tree Scraper — Full Threads + Scores&lt;/a&gt;&lt;/strong&gt;. Give it a subreddit or a list of thread URLs and you'll get back the full tree with scores. Drop in your own residential proxy to make it cheap at volume.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This scraper is one component of a much larger system. We use it inside &lt;a href="https://engine.signalsprint.io" rel="noopener noreferrer"&gt;SignalEngine&lt;/a&gt; — an autonomous outbound engine that turns Reddit (and other) conversations into qualified leads with AI-drafted, context-aware replies. If you'd rather have the conversations turned into pipeline automatically than wire up the data yourself, that's what the engine is for.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>javascript</category>
      <category>reddit</category>
      <category>apify</category>
    </item>
  </channel>
</rss>
