DEV Community: James Taylor

How we built a hiring-intent lead finder using Google as the backend (no login, no ban risk)

James Taylor — Fri, 05 Jun 2026 10:33:33 +0000

Job posts are the strongest B2B buying signal there is. Here's how we turned public Google search results into a hiring-intent lead finder — and the parsing traps that nearly sank it.

A company advertising a "Marketing Manager, London" is telling you three things at once: it has budget, it has a gap right now, and you know exactly what the gap is. That's the strongest cold-outreach trigger in B2B — and it's sitting in public, on job boards, for free.

So we built a small Apify actor that turns it into a lead list: give it roles + locations, get back one lead per hiring company with the role, the location, the job link, and a ready-to-paste opener. Here's how it works, and — more usefully — the three parsing traps that nearly made the output garbage.

The core trick: don't scrape the job boards. Search them.

Indeed, LinkedIn and Glassdoor all run serious anti-bot (Cloudflare, DataDome). Scraping them directly means residential proxies, headless browsers, and a constant cat-and-mouse you will eventually lose.

You don't have to play. Google has already crawled those postings. So instead of fetching indeed.com, you ask Google:

"Marketing Manager" "London" (site:indeed.com OR site:linkedin.com/jobs OR site:glassdoor.com)

Read the search-results HTML, parse the titles, done. No login, no cookie, no anti-bot wall on the boards themselves — nothing of yours to get blocked. We route the Google request through Apify's GOOGLE_SERP proxy (it's HTTP-only — you request http://www.google.com/search?... and the proxy does the TLS to Google) with got-scraping, and fall back to Bing on an empty result.

That part took an afternoon. Then we ran it for real, and the output was junk. Here's why — and the fixes.

Trap 1: site:indeed.com returns category pages, not jobs

The first live run for "Marketing Manager / Leeds" returned "companies" like Email Marketing Leeds and Performance Marketing Leeds Ls10. Those aren't businesses — they're Indeed's category/listing pages (indeed.com/q-email-marketing-l-leeds-jobs.html), which rank brilliantly for SEO and name no single employer.

The fix is to target the posting path, not the board root:

const BOARD_SITES = {
  indeed:    'indeed.com/viewjob',
  linkedin:  'linkedin.com/jobs/view',
  glassdoor: 'glassdoor.com/job-listing',
};

site:linkedin.com/jobs/view "Marketing Manager" "London" returns individual postings whose titles read cleanly — "Marketing Manager - Spotify", "House of CB hiring Marketing Manager". Same query against the board root returns the listing-page noise. One-line change, completely different output quality.

*Trap 2: a Google login link that *looked like a job host
**
A accounts.google.com/ServiceLogin?...continue=...site:indeed.com... URL slipped through and became a "lead." The bug: we were checking whether the job-host string appeared anywhere in the URL — and the search query (with site:indeed.com in it) was echoed inside the continue= parameter.

Fix: match on the parsed host, not a substring of the whole URL.

function hostMatches(url, hosts) {
  const u = new URL(url);
  const host = u.hostname.toLowerCase();
  const hostPath = (host + u.pathname).toLowerCase();
  return hosts.some(h =>
    h.includes('/') ? hostPath.includes(h)        // linkedin.com/jobs/view
                    : host === h || host.endsWith(`.${h}`)); // indeed.com
}

Lesson that keeps recurring in scraping: parse the thing, don't substring-match the thing.

Trap 3: Google's near-matches

Searching for "Plumber" surfaced "Solar Installer" and "Cyber Security Architect" postings — Google helpfully returns loosely-related results, and our title parser dutifully extracted those roles as companies.

The fix is a relevance gate: keep a posting only if its title actually contains the role you searched for.

export function titleMatchesRole(title, role) {
  const t = title.toLowerCase();
  const tokens = role.toLowerCase().split(/[^a-z0-9]+/).filter(Boolean);
  const sig = tokens.filter(w => w.length >= 4);
  return (sig.length ? sig : tokens).some(w => t.includes(w));
}

This sharpened precision dramatically for named professional roles (marketing, sales, ops) — exactly the roles where "you're hiring for this, here's why you might not need to" is a killer opener.

The honest part

Even after all that, company-name extraction from arbitrary job-board titles isn't perfect — Indeed titles especially are inconsistent. So every result carries the jobUrl: one click verifies the company. We say so plainly in the docs rather than pretending the parse is flawless. LinkedIn and Glassdoor titles (Company hiring Role) extract cleanest; Indeed adds breadth.

Optional last step: flip on findEmails and, for each distinctively-named company, it finds a decision-maker from public LinkedIn results and enriches a verified work email via your own Prospeo key. We gate that to distinctive company names — running an email lookup on a vague extracted name ("Delivery & Digital") just matches a random person at the wrong company, and a confidently-wrong email is worse than none.

Try it

It's live on the Apify Store, pay-per-result: Hiring Intent Lead Finder. Point it at a role + city and you'll get a graded list of companies with a live buying signal.

It's one piece of a bigger thing we're building — SignalEngine, agentic outbound that discovers, enriches, and emails leads autonomously. The hiring finder is a taste of the discovery layer.

If you'd rather find which local businesses are leaking leads than who's hiring, we shipped a sibling actor for that too — Local Business Website Audit grades a homepage's lead-capture (contact form, click-to-call, chat, booking) and hands back the weak ones as a prospect list.

Building these in public — next up is pushing them toward Apify Rising Stars. The recurring lesson across all of them: reaching the data is easy; the entire game is in how honestly you parse it.

How we built a Reddit comment-tree scraper that returns upvote scores — through a residential proxy

James Taylor — Thu, 04 Jun 2026 15:26:08 +0000

Most "Reddit scrapers" quietly lie to you. They hand back a flat list of top-level comments with no upvote scores, no nesting, and no idea which reply was buried at the bottom of a 200-comment thread. That's because they're reading Reddit's RSS feed — the one endpoint Reddit still serves cheaply — and RSS throws away almost everything that makes a Reddit discussion interesting.

We needed the real thing: every comment, with its author, body, upvote score, depth, and parent, plus the post's score and upvote ratio. So we built it, published it on the Apify Store as Reddit Comment Tree Scraper, and this post walks through exactly how it works — the 403 wall, why a residential proxy is non-negotiable, and the one trick that keeps the cost sane.

Why Reddit is hard to scrape (and why RSS is a cop-out)

Reddit used to have a famously friendly JSON API: append .json to any thread URL and you'd get the whole tree. Then they locked it down. Today, if you fetch() a thread's .json from a server, you get a 403. It's gated on two things at once:

IP reputation. Datacenter IPs (AWS, GCP, Hetzner, the usual suspects) are blocked outright. A residential IP from a real ISP passes.
TLS / client fingerprint. Even from a residential IP, a plain HTTP client gets challenged. Reddit fingerprints the TLS handshake and headers and can tell a node-fetch from a real browser.

A datacenter IP + a real browser still 403s. A residential IP + curl still gets challenged. You need both: a residential IP and a real browser. That's the whole problem in one sentence, and it's why the cheap actors don't bother — they fall back to RSS, which is unauthenticated and gives you flat, scoreless comments.

If all you need is "what are the new posts in r/SaaS," RSS is fine (and we use it ourselves for cheap discovery — more on that below). But if you need the engagement data — which comment actually resonated, how deep the thread went, what the sentiment looked like at each level — RSS can't help you.

The approach: warm a real browser, then read the canonical JSON

Here's the core insight that makes the actor both reliable and affordable:

You don't need to render every page. You need a real browser to clear Reddit's gate once, and then you can fetch the lightweight .json from inside that same browser context as many times as you like.

So the flow is:

Spin up a headless Chromium through a residential proxy.
Navigate to old.reddit.com once — this clears the anti-bot gate and warms the session (cookies, fingerprint, the works).
From inside that warmed page, fetch() each thread's canonical .json. Because the request now originates from a real, gate-cleared browser context, Reddit serves it.
Parse the JSON into a clean post + comment tree.

The key line is the in-page fetch. We use Playwright's page.evaluate() to run the fetch in the browser's own JS context, so it inherits the warmed session:

const json = await page.evaluate(async (u) => {
  const r = await fetch(u, { headers: { Accept: 'application/json' } });
  if (!r.ok) return { __status: r.status };
  return await r.json();
}, jsonUrl);

That jsonUrl is just the thread URL with ?limit=200&raw_json=1 tacked on. raw_json=1 stops Reddit from HTML-escaping the comment bodies, so you get clean text instead of & soup.

Getting the whole tree, not just the first page

Reddit serves roughly the top 200 comments per thread and collapses the rest into "load more comments" stubs. If you stop there, you silently lose the deepest, often most candid replies.

Those stubs aren't dead ends — each one carries the IDs of the comments it's hiding. We collect those IDs and POST them to Reddit's /api/morechildren endpoint (again, from inside the warmed browser context), 100 at a time, until we hit the user's maxComments cap:

const body = new URLSearchParams({
  link_id: linkId,        // t3_<postId>
  children: children,     // up to 100 comment IDs
  api_type: 'json',
  sort: 'confidence',
  raw_json: '1',
});

This is the difference between a scraper that returns "the 200 comments Reddit felt like showing" and one that returns the actual discussion. Each comment comes back with its depth and parentId, so you can rebuild the exact nesting — or just use the flat list with scores attached.

The cost problem — and the trick that solves it

Residential proxy bandwidth is the floor on cost for any serious Reddit scrape. Apify's residential proxy runs about $8/GB. If you naively launched a fresh browser and a fresh proxy IP for every single thread, you'd pay for a full page render and a new IP rotation on every request. That gets expensive fast.

Two levers fix this:

1. Warm once per session, then batch. Each worker opens one proxy IP, clears the gate once, then fires up to threadsPerSession (default 15) thread-.json fetches through that same warmed context before rotating to a fresh IP. Browser startup and gate-clearing — the expensive parts — get amortised across 15 threads instead of paid once per thread. After that, you're mostly paying for lightweight JSON payloads, not page renders.

async function worker() {
  while (threads.length) {
    const session = await openWarmedContext();   // one IP, gate cleared once
    let inSession = 0;
    while (threads.length && inSession < threadsPerSession) {
      const ref = threads.shift();
      await fetchThreadInPage(session.page, ref); // cheap JSON fetch
      inSession += 1;
    }
    await session.ctx.close();                    // rotate IP, repeat
  }
}

2. Bring your own residential proxy. This is the big one. The actor uses Apify's createProxyConfiguration, which transparently accepts a "Custom proxies" option in the proxy input. Paste your own residential proxy URLs — providers like IPRoyal sell residential bandwidth at $1–2/GB — and you're typically 3–5× cheaper than Apify's residential, with zero code changes. The actor rotates your IPs per session exactly the same way.

That BYO-proxy support is deliberate. We run this actor inside our own product at high volume, and the proxy economics are the whole game at scale.

Reliability: requeue on a fresh IP

Residential IPs are flaky by nature — some are slow, some are already rate-limited by Reddit, some just die mid-session. The actor treats a blocked or stale fetch as retryable: a thread that fails gets pushed back onto the queue (up to 3 tries) and picked up by the next warmed session on a fresh IP. A thread that comes back valid-but-empty (deleted/removed post) is not retried — there's nothing there to get.

This is the difference between "works in a demo" and "works on 10,000 threads overnight." You assume IPs will fail and design the retry around it, rather than treating every failure as fatal.

Discovery for free

One more economy: you don't need the expensive browser path just to find threads. Reddit's per-subreddit RSS listing is still served cheaply and unauthenticated. So when you give the actor a list of subreddits, it pulls the listing via plain RSS to discover thread IDs, and only spends the residential-browser budget on the actual deep scrape of each thread. Cheap where you can be, expensive only where you must be.

What you get back

One clean record per thread:

{
  "type": "post",
  "subreddit": "SaaS",
  "title": "How we cut churn 30%",
  "score": 142,
  "upvoteRatio": 0.97,
  "numComments": 88,
  "comments": [
    {
      "author": "growth_greg",
      "body": "What did your onboarding look like before?",
      "score": 24,
      "depth": 0,
      "parentId": "t3_abc123"
    }
  ]
}

Every comment carries the score and the tree position. That's the data sentiment models, social-listening tools, and trend analysts actually need — and the data RSS-based scrapers structurally cannot give you.

Compliance note

The actor reads public Reddit data only. It never logs in, posts, votes, or messages. Use the data in line with Reddit's terms and whatever laws apply to you. We built it for research, analysis, and social listening — not for spamming subreddits.

Try it

The actor is live on the Apify Store: Reddit Comment Tree Scraper — Full Threads + Scores. Give it a subreddit or a list of thread URLs and you'll get back the full tree with scores. Drop in your own residential proxy to make it cheap at volume.

This scraper is one component of a much larger system. We use it inside SignalEngine — an autonomous outbound engine that turns Reddit (and other) conversations into qualified leads with AI-drafted, context-aware replies. If you'd rather have the conversations turned into pipeline automatically than wire up the data yourself, that's what the engine is for.