How to Scrape Emails and Contacts From Any Website (No API Key)

#webscraping #leadgeneration #javascript #showdev

Most "find emails on a website" tutorials reach for a paid API on the second paragraph. You don't need one. Email addresses, phone numbers and social links are sitting in the public HTML of almost every business site. The hard parts are knowing which pages to fetch, parsing the matches without drowning in false positives, and doing it politely enough that you don't get blocked. This post walks through how to build a no-API-key contact scraper, and where the same logic falls apart at scale.

Why you don't need an API key

A "contact enrichment API" is mostly doing three things on your behalf:

Fetching a handful of pages from the target domain.
Running regex/heuristics over the HTML.
De-duplicating and scoring the results.

All three are things you can do yourself with fetch and a parser. The API's real value is the database it has pre-crawled, plus deliverability verification. For finding the emails a company actually publishes on its own site, you're paying for steps you can run locally.

Step 1: hit the right pages first, not the homepage

The single biggest mistake is scraping only the homepage. Companies almost never put their real contact email on /. They put it on /contact, /about, /team, /imprint, /support, or a footer that links to those.

So the crawl order matters more than the crawl depth. A good heuristic: fetch the homepage, extract internal links, then prioritize any link whose URL or anchor text matches a contact-intent pattern.

const CONTACT_HINTS = /contact|about|team|imprint|impressum|support|help|kontakt|nosotros|equipo/i;

function rankLinks(links, baseUrl) {
  return links
    .filter(href => sameHost(href, baseUrl))
    .sort((a, b) => score(b) - score(a)); // contact-intent pages first
}

function score(href) {
  return CONTACT_HINTS.test(href) ? 10 : 1;
}

Crawling 3-4 ranked pages beats crawling 30 random ones, both for hit-rate and for not hammering the server.

Step 2: extract without the false-positive swamp

The naive email regex matches a lot of junk: image filenames like logo@2x.png, Sentry/analytics keys, version strings. Tighten it and then filter.

const EMAIL_RE = /[a-z0-9._%+\-]+@[a-z0-9.\-]+\.[a-z]{2,}/gi;

function extractEmails(html) {
  const raw = html.match(EMAIL_RE) || [];
  return [...new Set(raw)]
    .filter(e => !/\.(png|jpe?g|gif|webp|svg)$/i.test(e))     // image @2x assets
    .filter(e => !/^[0-9a-f]{8,}@/i.test(e))                  // hashed analytics ids
    .filter(e => !/(sentry|wixpress|example|domain)\.com$/i.test(e));
}

Don't forget obfuscated addresses. Many sites write hello [at] company [dot] com or hide the address behind mailto: only, or split it with @ HTML entities. Decode entities before you run the regex, and add a second pass for the [at]/[dot] pattern.

Phones and socials are the same idea: a permissive regex plus a denylist. For socials, match linkedin.com/company/, twitter.com/, instagram.com/ etc. and strip share-intent URLs (/share?, /sharer).

Step 3: be polite or get blocked

Set a real User-Agent. Default fetch agents get filtered.
Respect a small concurrency cap per host (2-3) and add jitter.
Honor robots.txt for the paths you crawl.
Cache by host so you don't refetch the homepage for every email you want.

Most "the scraper stopped working" reports are really "the scraper got rate-limited because it fired 50 parallel requests with a python-requests UA."

Where the DIY version breaks down

The script above is great for tens of sites. Past that you hit the operational tail:

Concurrency + proxies so one IP doesn't get blocked across thousands of domains.
Retries with backoff for the 10-15% of sites that flake on first request.
JS-rendered contact widgets (some sites inject the email via JavaScript), which need a headless browser only sometimes — running one for every site is wasteful.
Tech-stack detection if you want to segment leads (e.g. "all Shopify stores in this list").

That's exactly the boundary where I stopped maintaining a local script and moved the logic onto a hosted runner. I publish two of them as pay-per-result scrapers on Apify:

Email Scraper & Contact Finder — feed it a list of websites, it does the ranked-crawl + extraction + tech-stack detection described above and returns emails, phones and socials. No API key, no login.
Google Maps Email Extractor — when you don't even have the list of websites yet. Give it a search term + location, it pulls the local businesses from Google Maps (name, address, phone, website, rating) and then runs the contact crawl on each site to get the email. No Google API key.

Both run on Apify's free tier to try, and you pay per result rather than a monthly seat — which is the right shape for lead-gen work that's bursty rather than constant.

Takeaway

For finding the contact details a business publishes about itself, the API-key requirement is mostly artificial. Ranked crawling (contact pages first), a tightened regex with a denylist, entity/obfuscation decoding, and basic politeness get you most of the way. Reach for a hosted runner only when concurrency, proxy rotation and retries become the actual job — which is later than most tutorials would have you believe.