DEV Community

James Taylor
James Taylor

Posted on

How we built a hiring-intent lead finder using Google as the backend (no login, no ban risk)

Job posts are the strongest B2B buying signal there is. Here's how we turned public Google search results into a hiring-intent lead finder — and the parsing traps that nearly sank it.

A company advertising a "Marketing Manager, London" is telling you three things at once: it has budget, it has a gap right now, and you know exactly what the gap is. That's the strongest cold-outreach trigger in B2B — and it's sitting in public, on job boards, for free.

So we built a small Apify actor that turns it into a lead list: give it roles + locations, get back one lead per hiring company with the role, the location, the job link, and a ready-to-paste opener. Here's how it works, and — more usefully — the three parsing traps that nearly made the output garbage.

The core trick: don't scrape the job boards. Search them.

Indeed, LinkedIn and Glassdoor all run serious anti-bot (Cloudflare, DataDome). Scraping them directly means residential proxies, headless browsers, and a constant cat-and-mouse you will eventually lose.

You don't have to play. Google has already crawled those postings. So instead of fetching indeed.com, you ask Google:

"Marketing Manager" "London" (site:indeed.com OR site:linkedin.com/jobs OR site:glassdoor.com)
Enter fullscreen mode Exit fullscreen mode

Read the search-results HTML, parse the titles, done. No login, no cookie, no anti-bot wall on the boards themselves — nothing of yours to get blocked. We route the Google request through Apify's GOOGLE_SERP proxy (it's HTTP-only — you request http://www.google.com/search?... and the proxy does the TLS to Google) with got-scraping, and fall back to Bing on an empty result.

That part took an afternoon. Then we ran it for real, and the output was junk. Here's why — and the fixes.

Trap 1: site:indeed.com returns category pages, not jobs

The first live run for "Marketing Manager / Leeds" returned "companies" like Email Marketing Leeds and Performance Marketing Leeds Ls10. Those aren't businesses — they're Indeed's category/listing pages (indeed.com/q-email-marketing-l-leeds-jobs.html), which rank brilliantly for SEO and name no single employer.

The fix is to target the posting path, not the board root:

const BOARD_SITES = {
  indeed:    'indeed.com/viewjob',
  linkedin:  'linkedin.com/jobs/view',
  glassdoor: 'glassdoor.com/job-listing',
};
Enter fullscreen mode Exit fullscreen mode

site:linkedin.com/jobs/view "Marketing Manager" "London" returns individual postings whose titles read cleanly — "Marketing Manager - Spotify", "House of CB hiring Marketing Manager". Same query against the board root returns the listing-page noise. One-line change, completely different output quality.

*Trap 2: a Google login link that *looked like a job host
**
A accounts.google.com/ServiceLogin?...continue=...site:indeed.com... URL slipped through and became a "lead." The bug: we were checking whether the job-host string appeared anywhere in the URL — and the search query (with site:indeed.com in it) was echoed inside the continue= parameter.

Fix: match on the parsed host, not a substring of the whole URL.

function hostMatches(url, hosts) {
  const u = new URL(url);
  const host = u.hostname.toLowerCase();
  const hostPath = (host + u.pathname).toLowerCase();
  return hosts.some(h =>
    h.includes('/') ? hostPath.includes(h)        // linkedin.com/jobs/view
                    : host === h || host.endsWith(`.${h}`)); // indeed.com
}
Enter fullscreen mode Exit fullscreen mode

Lesson that keeps recurring in scraping: parse the thing, don't substring-match the thing.

Trap 3: Google's near-matches

Searching for "Plumber" surfaced "Solar Installer" and "Cyber Security Architect" postings — Google helpfully returns loosely-related results, and our title parser dutifully extracted those roles as companies.

The fix is a relevance gate: keep a posting only if its title actually contains the role you searched for.

export function titleMatchesRole(title, role) {
  const t = title.toLowerCase();
  const tokens = role.toLowerCase().split(/[^a-z0-9]+/).filter(Boolean);
  const sig = tokens.filter(w => w.length >= 4);
  return (sig.length ? sig : tokens).some(w => t.includes(w));
}
Enter fullscreen mode Exit fullscreen mode

This sharpened precision dramatically for named professional roles (marketing, sales, ops) — exactly the roles where "you're hiring for this, here's why you might not need to" is a killer opener.

The honest part

Even after all that, company-name extraction from arbitrary job-board titles isn't perfect — Indeed titles especially are inconsistent. So every result carries the jobUrl: one click verifies the company. We say so plainly in the docs rather than pretending the parse is flawless. LinkedIn and Glassdoor titles (Company hiring Role) extract cleanest; Indeed adds breadth.

Optional last step: flip on findEmails and, for each distinctively-named company, it finds a decision-maker from public LinkedIn results and enriches a verified work email via your own Prospeo key. We gate that to distinctive company names — running an email lookup on a vague extracted name ("Delivery & Digital") just matches a random person at the wrong company, and a confidently-wrong email is worse than none.

Try it

It's live on the Apify Store, pay-per-result: Hiring Intent Lead Finder. Point it at a role + city and you'll get a graded list of companies with a live buying signal.

It's one piece of a bigger thing we're building — SignalEngine, agentic outbound that discovers, enriches, and emails leads autonomously. The hiring finder is a taste of the discovery layer.

If you'd rather find which local businesses are leaking leads than who's hiring, we shipped a sibling actor for that too — Local Business Website Audit grades a homepage's lead-capture (contact form, click-to-call, chat, booking) and hands back the weak ones as a prospect list.

Building these in public — next up is pushing them toward Apify Rising Stars. The recurring lesson across all of them: reaching the data is easy; the entire game is in how honestly you parse it.

Top comments (0)