i built a scraping pipeline that extracts verified email addresses from marketing agency websites across 54 countries. here's the exact process, the code patterns, and what i learned scraping at scale.
why agency emails?
i sell cold outreach services to marketing agencies. to pitch them, i need their email addresses. buying lists is expensive and often outdated. scraping them myself means fresher data and zero cost.
the tools
beautifulsoup4 # HTML parsing
requests # HTTP requests
re # email regex extraction
json # batch file management
smtplib # sending (later)
step 1: find agencies via web search
for each target city, i search for "digital marketing agency {city} {country} email contact". the first 2-3 pages of results usually contain 8-15 agency websites.
step 2: extract emails from websites
for each agency URL, i scrape their contact page, about page, and homepage. the email extraction regex:
import re
def extract_emails(html_text):
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
raw = re.findall(pattern, html_text)
# filter out image filenames and spam traps
filtered = [e for e in raw if not any(
ext in e.lower() for ext in
['.png', '.jpg', '.avif', '.svg', 'sentry', 'schema', 'cloudflare']
)]
return list(set(filtered))
the key insight: always check /contact, /contact-us, and /about pages. many agencies hide their email on the contact page only.
step 3: scan their site for personalization data
before emailing, i run my SEO analyzer against their domain. this gives me specific issues to reference in the pitch — missing alt text, no meta descriptions, slow page load. real problems, not generic "your SEO could be better."
step 4: generate personalized emails
each email template includes:
- their domain name in the subject line
- specific SEO findings from the scan
- a clear value proposition
- link to a free sample or landing page
the results
- 798 verified agency emails across 54 countries
- additional 222 dentist practices and 554 multi-niche prospects
- total pipeline: 1,204 emails ready to send
common scraping pitfalls
- image filenames match email regex — always filter .png, .jpg, .avif
- contact forms without visible emails — skip these, move to next agency
- cloudflare/sentry emails in page source — filter by domain
- rate limiting — add 2-3 second delays between requests
-
broken SSL certificates — use
verify=Falsewith caution
the data product
i packaged the agency contacts into a downloadable CSV:
- free 50-agency sample — verify the data quality yourself
- SEO chrome extension — scan any site for SEO issues ($9)
- full outreach service — managed cold email campaigns
the scraping code runs on a basic linux server with cron jobs. total infrastructure cost: $0/month (using existing server).
if you're building outreach tools or scraping at scale, the hardest part isn't the code — it's maintaining data quality as you scale past hundreds of entries.
Top comments (0)