How to scrape business contact details and verify the emails in one step

#python #api #automation #webdev

If you do cold outreach, you have hit this wall: you scrape a list of company sites, get a pile of emails, load them into your sequencer, and a third of them bounce. Your sender reputation drops, and now even the good ones land in spam. The usual fix is two tools: one to scrape, one to verify. Here is how to do both in a single pass so the list that comes out is already usable.

Why the two-step way leaks

A plain contact scraper returns whatever matches an email pattern on the page. On a real business site that means you also collect placeholders from the template (info@example.com), demo addresses, a Sentry error-tracking key that looks like an email (key@o0.ingest.sentry.io), git@github.com from a code snippet, and third-party addresses smuggled into a mailto:?cc= link. Then you pay a second service to verify the whole dirty pile, including the garbage. The waste compounds.

The better approach is to validate as you extract, and only keep what survives.

Step 1: extract cleanly, not greedily

The trick that removes most of the noise is to stop trusting raw HTML attributes. Real contact emails live in visible text or in JSON-LD structured data. Tracking params, image filenames, and dev keys live in attributes and scripts. So strip executable scripts and blank out attribute values before you run the email regex, and take mailto: addresses only up to the ? (never the query string):

import re

# blank href/src/content attribute VALUES so ?cc=, ?ref=, data-url junk can't match
ATTR = re.compile(r'(href|src|content|data-[\w-]+)\s*=\s*("[^"]*"|\'[^\']*\')', re.I)
clean = ATTR.sub(r"\1=''", html_without_scripts)
emails = set(re.findall(r"\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b", clean))

Then drop known service domains (sentry.io, github.com, googleapis.com) and placeholder locals (name@, you@, jane.doe@). What is left is candidate contacts, not soup.

Step 2: verify each one without sending

You do not need to send an email or even open an SMTP connection to catch most bad addresses. Resolve the domain's MX records. No MX (and no fallback A record) means mail cannot be delivered. Flag disposable domains by suffix match, and flag role inboxes (info@, sales@). Score each address 0 to 100 and assign a verdict:

import dns.resolver
def deliverable(domain):
    try:
        mx = dns.resolver.resolve(domain, "MX")
        return any(str(r.exchange) not in (".", "") for r in mx)
    except Exception:
        return False

Cache the result per domain. A list of 5,000 emails often has only a few hundred unique domains, so you resolve each once.

The state most tools get wrong

Two-state verification (valid or invalid) quietly corrupts your list. A DNS timeout is not proof an address is dead, it is proof you could not check right now. Keep a third state: unknown. Mark it, do not drop it, and never charge for it. The three buckets are valid (deliverable, keep), invalid (no MX, drop), and unknown (retry later). That third state is the difference between a tool that cleans your list and one that silently deletes real leads.

What good output looks like

Run the combined pipeline on a real site and you get the deliverable contact surfaced on top, scored, with the noise already gone. For example, scanning basecamp.com returns jason@basecamp.com with a valid verdict and a 100 score, while the demo and service addresses that a naive scraper would have mixed in are dropped before they ever reach you. One pass, one clean row per company.

One honest caveat

There is no SMTP handshake here, and that is on purpose. From shared datacenter IPs, SMTP probes get greylisted or blocked and return unreliable answers, and the attempt can get your IP flagged. So this is DNS-tier confidence: it proves the domain accepts mail, not that one specific person's mailbox exists. A real but fictional demo address on a real domain can still pass. That is the honest ceiling of verify-without-sending, and it still removes the large majority of bounces.

The shortcut

If you would rather not wire up the crawler, the noise filters, the DNS cache, and the three-state logic, I packaged the whole find-and-verify pipeline into one tool. You feed it a list of business sites and it returns deliverable, scored contacts in one run, with validation on by default and failed fetches never billed. It is the Lead-Gen Pipeline Pro on Apify. Either way, validate before you send, and make sure your tool keeps that third state.

Top comments (2)

Hayrullah Kar • Jun 15

The data leaks between traditional scraping and post-verification are a silent budget killer, and highlighting the three-state logic (valid, invalid, unknown) is a game-changer. Most scrapers discard network timeouts, blindly throwing away perfectly good enterprise leads.

If I could add one tiny critique: while avoiding risky SMTP handshakes is smart, checking for "Catch-All" (Accept-All) domains at the DNS layer would make this bulletproof. If a domain accepts any fake username, dropping its score prevents domain-wide bounce risks.

Brilliant, efficient pipeline design!

Larry Johnson • Jun 15

Really appreciate this, and you are pointing right at the part that keeps me up. Catch-all is the honest ceiling here. The catch is that accept-all is an SMTP behavior, not something a domain advertises at the DNS layer, so the textbook way to detect it is exactly the RCPT-to-a-random-address probe I am trying to avoid from shared IPs. What I have landed on without the handshake is treating it as a confidence cap rather than a hard verdict: keep a list of known accept-all providers and patterns, and when a domain looks catch-all, drop the result into the risky band instead of valid so the buyer does not trust a clean-looking row blindly. Not bulletproof, you are right, but it keeps the honesty intact. If you have seen a reliable way to sniff catch-all short of an SMTP probe, I would genuinely love to hear it. That one is still an open problem for me.