DEV Community

ByteHarvester
ByteHarvester

Posted on

My Google Maps scraper is live for 2 weeks. Half the emails were bounce-bait — here's what I added.

📝 Follow-up to "I Built a Google Maps Email Scraper That Finds 74% More Emails Than the Competition" (Apr 16, 2026). Two weeks later, I shipped inline email validation (MX/SPF/DMARC) — and turns out half of the emails the scraper had been finding wouldn't have delivered. Here's what I added and what I learned.


TL;DR

I built a Google Maps scraper on Apify that validates every email it finds inline (MX records + SPF + DMARC + catch-all detection). On a 168-record Austin dentists run, only 42 of the 89 emails graded "high deliverability" — meaning roughly half of the scraped emails would have hurt my domain reputation if I'd sent cold-email to them unchecked.

Bigger surprise: in EU markets, the email hit rate jumps from ~5% to 85% when you add localized contact-page paths (kapcsolat, impressum, contacto, contatti, contactez-nous, kontakty). Most scrapers don't.

Code patterns + actor link below.


The problem

50+ Google Maps scrapers exist on Apify alone. Most do the same pipeline:

Search Google Maps → harvest place URLs → visit each website → regex-grep emails → return JSON
Enter fullscreen mode Exit fullscreen mode

Output looks fine. But ~50% of those emails bounce or land in spam when you actually send to them. Why?

  • Typos in the website itself (info@compant.com)
  • Dead domains (MX returns NXDOMAIN)
  • Catch-all servers (accept any RCPT TO, then bounce silently)
  • No SPF/DMARC at the receiver — your sender reputation gets clobbered

And in non-English markets, the scraper often returns no email at all because the contact page is at /kapcsolat or /impressum, not /contact.

Two fixes:

  1. Inline email validation (MX/SPF/DMARC + catch-all)
  2. Multilingual contact-page crawl

Inline email validation, in code

Five-layer probe per email:

const dns = require('dns/promises');

async function validateEmail(email) {
  const [, domain] = email.split('@');
  const result = { mxRecords: 0, hasSpf: false, hasDmarc: false,
                    smtpValid: null, isCatchAll: null, deliverability: 'unknown' };

  // 1. MX records — does the domain accept mail?
  try {
    const mx = await dns.resolveMx(domain);
    result.mxRecords = mx.length;
  } catch { result.mxRecords = 0; }

  if (result.mxRecords === 0) {
    result.deliverability = 'low';
    return result;
  }

  // 2-3. SPF + DMARC TXT lookups
  try {
    const txt = await dns.resolveTxt(domain);
    result.hasSpf = txt.some(arr => arr.join('').toLowerCase().startsWith('v=spf1'));
  } catch {}
  try {
    const dmarcTxt = await dns.resolveTxt(`_dmarc.${domain}`);
    result.hasDmarc = dmarcTxt.some(arr => arr.join('').toLowerCase().startsWith('v=dmarc1'));
  } catch {}

  // 4. SMTP RCPT TO probe (optional, often blocked by Gmail/Outlook)
  // ... skipped for brevity, see full code

  // 5. Roll up to grade
  if (result.mxRecords > 0 && result.hasSpf && result.hasDmarc) {
    result.deliverability = 'high';
  } else if (result.mxRecords > 0) {
    result.deliverability = 'medium';
  }

  return result;
}
Enter fullscreen mode Exit fullscreen mode

Per-domain DNS probing takes ~50ms. For 100 emails, you spend ~5 seconds total. Caching by domain makes this cheaper across batches.

Compare this to paid validators:

  • ZeroBounce: $0.007/email
  • NeverBounce: $0.008/email
  • Bouncer: $0.004/email
  • Kickbox: $0.01/email

For 1,000 leads, that's $4-$10 you don't need to spend if validation is built into the scraper.

Multilingual contact-page crawl, in code

The actual content of the URL frontier:

const CONTACT_PATHS = [
  // English
  '/contact', '/contact-us', '/about', '/about-us',

  // Hungarian
  '/kapcsolat', '/elerhetoseg',

  // German (Impressum is legally required)
  '/kontakt', '/impressum', '/ansprechpartner',

  // Spanish
  '/contacto', '/contactar', '/contactenos',

  // Italian
  '/contatti', '/contattaci',

  // French
  '/contactez-nous', '/nous-contacter',

  // Polish
  '/kontakt-z-nami', '/kontakty',

  // Czech / Slovak
  '/kontakt', '/kontaktujte-nas',

  // Portuguese (BR + PT)
  '/contato', '/contatos', '/contacto',

  // Dutch
  '/over-ons',
];

async function crawlForEmails(baseUrl) {
  const emails = new Set();
  const re = /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b/gi;
  for (const path of CONTACT_PATHS) {
    try {
      const html = await fetch(new URL(path, baseUrl)).then(r => r.text());
      (html.match(re) || []).forEach(e => emails.add(e.toLowerCase()));
    } catch {}
  }
  return [...emails];
}
Enter fullscreen mode Exit fullscreen mode

In a 20-record Berlin Mitte sample (Zahnarzt Berlin Mitte), this single change moved the email hit rate from 5% to 85%. Why? Because the German Impressum page is legally required to disclose owner name + email, so it's almost always present and almost always contains a real human's email.

Real run stats

I ran the actor with geoGridTiles: 3 (a 3×3 viewport grid over Austin, TX) and maxResults: 150. Cross-tile dedup filtered 43 duplicates. Final dataset: 168 unique leads.

Metric Count Rate
With email 89 / 168 53%
With phone 168 / 168 100%
With website 162 / 168 96%
Email graded high 42 / 89 47% of emails
Lead readiness hot 139 / 168 83%
Modern websites 117 / 168 70%
Total cost $0.84
Runtime 22 min

EU markets gave even better numbers because of the Impressum law:

Market Sample Email Hit Rate High Deliverability
Austin TX (US) 168 53% 25%
Manhattan (US) 25 64% 32%
Shoreditch (UK) 20 70% 40%
Berlin Mitte (DE) 20 85% 65%

Three patterns I'd reuse on any scraper

Preflight budget check. Estimate runtime BEFORE the run starts. If estimated > timeout, refuse to start (zero events charged) and tell the user exactly which knob to lower. Lots of users hate guessing whether their config will fit; once preflight shipped, my actor's timeout-rate dropped from 20% to ~0%.

Pay-per-result instead of CU-based. Apify lets you bill per "event" (PAY_PER_EVENT). I switched to $0.005 per delivered lead + $0.00005 per run start. Failed/timed-out runs cost $0. Customers love the predictability — they can budget exactly.

Delta mode. Pass the previous run's dataset ID; skip already-seen placeId AND cid BEFORE any billable event fires. Weekly recurring scrape costs the same as a one-off — you only pay for genuinely new businesses.

// Skip-before-bill check
if (knownPlaceIds.has(item.placeId) || knownCids.has(item.cid)) {
  continue; // no enrich, no email validation, no billing event
}
Enter fullscreen mode Exit fullscreen mode

Try it / fork it

The actor is on Apify Store: Google Maps Email Extractor with Built-in Email Validation

Free tier gives ~100 leads to test on your specific market. Drop a query like dentist Austin Texas or Zahnarzt Berlin Mitte and see what comes out.

Source code: the actor's source is closed on Apify, but the patterns above are MIT-licensed in this article — feel free to copy them into your own scraper. The biggest leverage is the multilingual contact-page list; the validation code is straightforward DNS plumbing.

If you've built something similar, or you have a market where my localization paths break, drop a comment. I'm tracking failure cases on the actor's Issues tab.

Top comments (1)

Collapse
 
projectdev profile image
ByteHarvester

Curious — for anyone running cold-outreach pipelines, what's your DNS-only validation false-negative rate vs paid services like ZeroBounce/NeverBounce?

In my own testing, ~30% of emails graded "high deliverability" still bounced when sent, mostly because the receiver MX accepts but the inbox is dormant or the catch-all flag was misread.

Anyone getting better numbers with a different scoring scheme?