This tutorial is for those who want to build their own email scraper. At the end, I’ll show an improved version – a Streamlit app that anyone can use (and check the source code on GitHub).
Step 1: Get the Page Source Code
Before scraping emails, you need the raw HTML. You’ve got two ways to get it: either use a web scraping API or do it yourself with requests or headless browser tools like Selenium or Playwright.
Option A: Use a Scraping API
For sites that load content dynamically or block scrapers, APIs like HasData make life easier. They handle proxies, captchas, and JavaScript.
First, import the libraries, set variables (HasData API key, URL), and the request headers:
import requests
import json
api_key = "YOUR-API-KEY"
url= "https://example.com"
headers = {
'Content-Type': 'application/json',
'x-api-key': api_key
}
Then set up the request body with the target URL and extractEmails: true
. This tells the API to return both the page content and a list of emails.
payload = json.dumps({
"url": url,
"proxyType": "datacenter",
"proxyCountry": "US",
"jsRendering": True,
"extractEmails": True,
})
Now, make the request:
response = requests.post("https://api.hasdata.com/scrape/web", headers=headers, data=payload)
If you’re sticking with this method, jump to the next part – handling the API response.
Option B: Use Python’s requests
Library
For static pages or quick tests, or if you just don’t want to mess with APIs, use requests to fetch the page:
import requests
import re
found_emails = set()
url = "https://example.com"
response = requests.get(url, timeout=10)
Next, we’ll extract emails, but we’ll get to that soon.
Step 2: Extract Email Addresses
Once you have the HTML, there are a couple of ways to extract emails.
Option A: Using the API’s Built-in Email Extractor
After the HasData API request:
response = requests.post("https://api.hasdata.com/scrape/web", headers=headers, data=payload)
Parse the response and extract the email data:
data = response.json()
emails = data.get("emails", [])
Print the results:
print(url, " | ", emails)
It’s not much use to save emails from just one site, but we’ll handle saving after wrapping this in a loop.
Option B: Use Regex to Extract Emails from HTML
If you’re going the hard way, use regex to extract emails matching a pattern:
if response.status_code == 200:
email_pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}"
emails = re.findall(email_pattern, response.text)
for email in emails:
found_emails.add((url, email))
else:
print(f"[{response.status_code}] {url}")
I did it this way to catch duplicate emails scattered around the page.
Step 3: Loop Through Multiple URLs
To scrape more than one site, load a list of URLs from a file or a variable:
with open("urls.txt", "r", encoding="utf-8") as file:
urls = [line.strip() for line in file if line.strip()]
Then loop through them, scraping each and collecting emails:
for url in urls:
Don’t forget a variable to store all found emails:
results = []
Add new pairs of URL and emails:
results.append({
"url": url,
"emails": emails
})
The code stays the same whether you use the API or scrape without it.
Step 4: Save the Results
Finally, save the collected emails. The easiest way is to write them to JSON:
with open("results.json", "w", encoding="utf-8") as json_file:
json.dump(results, json_file, ensure_ascii=False, indent=2)
Here’s how to save to CSV as well:
with open("results.csv", "w", newline="", encoding="utf-8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["url", "email"])
for result in results:
for email in result["emails"]:
writer.writerow([result["url"], email])
Other formats aren’t worth the trouble.
TL;DR
If you got lost, skipped stuff, or just want the code, this part is for you.
Email Scraper with HasData’s API
Here’s the full scraper code:
import requests
import json
import csv
api_key = "YOUR-API-KEY"
headers = {
'Content-Type': 'application/json',
'x-api-key': api_key
}
results = []
with open("urls.txt", "r", encoding="utf-8") as file:
urls = [line.strip() for line in file if line.strip()]
for url in urls:
payload = json.dumps({
"url": url,
"proxyType": "datacenter",
"proxyCountry": "US",
"jsRendering": True,
"extractEmails": True,
})
try:
response = requests.post("https://api.hasdata.com/scrape/web", headers=headers, data=payload)
response.raise_for_status()
data = response.json()
emails = data.get("emails", [])
results.append({
"url": url,
"emails": emails
})
except Exception as e:
results.append({
"url": url,
"emails": []
})
with open("results.json", "w", encoding="utf-8") as json_file:
json.dump(results, json_file, ensure_ascii=False, indent=2)
with open("results.csv", "w", newline="", encoding="utf-8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["url", "email"])
for result in results:
for email in result["emails"]:
writer.writerow([result["url"], email])
I also added try..except
blocks to catch errors.
Email Scraper with Regex
If you’re anti-API and into the hardcore way, here’s the code:
import requests
import re
import csv
found_emails = set()
output_file = "found_emails.csv"
file_path = "urls.txt"
with open(file_path, "r", encoding="utf-8") as file:
websites = [line.strip() for line in file if line.strip()]
email_pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}"
for website in websites:
response = requests.get(website, timeout=10)
if response.status_code == 200:
emails = re.findall(email_pattern, response.text)
for email in emails:
found_emails.add((website, email))
else:
print(f"[{response.status_code}] {website}")
with open(output_file, "w", newline="", encoding="utf-8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["Website", "Email"])
for website, email in found_emails:
writer.writerow([website, email])
print(f"Saved {len(found_emails)} emails to {output_file}")
But be ready – this won’t work on every site. For improvement, switch requests for Selenium or any tool that mimics real user actions.
Bonus: Ready-to-Use Email Scraper Tool
Want to skip coding? Or just see how the tutorial code can be leveled up?
Try the free Email Scraper Tool built with Streamlit + HasData API. It works with Google Search, Maps, and raw URLs, and exports to CSV/JSON.
That’s all. Now go scrape what you need. Just don’t overdo it.
Top comments (0)