Valentina Skakun for HasData

Posted on Jun 3

How to Build an Email Scraper (with Code + Free Tool)

#programming #python #tutorial #hasdata

This tutorial is for those who want to build their own email scraper. At the end, I’ll show an improved version – a Streamlit app that anyone can use (and check the source code on GitHub).

Step 1: Get the Page Source Code

Before scraping emails, you need the raw HTML. You’ve got two ways to get it: either use a web scraping API or do it yourself with requests or headless browser tools like Selenium or Playwright.

Option A: Use a Scraping API

For sites that load content dynamically or block scrapers, APIs like HasData make life easier. They handle proxies, captchas, and JavaScript.
First, import the libraries, set variables (HasData API key, URL), and the request headers:

import requests
import json

api_key = "YOUR-API-KEY"
url= "https://example.com"

headers = {
    'Content-Type': 'application/json',
    'x-api-key': api_key
}

Then set up the request body with the target URL and extractEmails: true. This tells the API to return both the page content and a list of emails.

payload = json.dumps({
    "url": url,
    "proxyType": "datacenter",
    "proxyCountry": "US",
    "jsRendering": True,
    "extractEmails": True,
})

Now, make the request:

response = requests.post("https://api.hasdata.com/scrape/web", headers=headers, data=payload)

If you’re sticking with this method, jump to the next part – handling the API response.

Option B: Use Python’s `requests` Library

For static pages or quick tests, or if you just don’t want to mess with APIs, use requests to fetch the page:

import requests
import re

found_emails = set()
url = "https://example.com"
response = requests.get(url, timeout=10)

Next, we’ll extract emails, but we’ll get to that soon.

Step 2: Extract Email Addresses

Once you have the HTML, there are a couple of ways to extract emails.

Option A: Using the API’s Built-in Email Extractor

After the HasData API request:

response = requests.post("https://api.hasdata.com/scrape/web", headers=headers, data=payload)

Parse the response and extract the email data:

data = response.json()
emails = data.get("emails", [])

Print the results:

print(url, " | ", emails)

It’s not much use to save emails from just one site, but we’ll handle saving after wrapping this in a loop.

Option B: Use Regex to Extract Emails from HTML

If you’re going the hard way, use regex to extract emails matching a pattern:

if response.status_code == 200:
    email_pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}"
    emails = re.findall(email_pattern, response.text)
    for email in emails:
        found_emails.add((url, email))
else:
    print(f"[{response.status_code}] {url}")

I did it this way to catch duplicate emails scattered around the page.

Step 3: Loop Through Multiple URLs

To scrape more than one site, load a list of URLs from a file or a variable:

with open("urls.txt", "r", encoding="utf-8") as file:
    urls = [line.strip() for line in file if line.strip()]

Then loop through them, scraping each and collecting emails:

for url in urls:

Don’t forget a variable to store all found emails:

results = []

Add new pairs of URL and emails:

        results.append({
            "url": url,
            "emails": emails
        })

The code stays the same whether you use the API or scrape without it.

Step 4: Save the Results

Finally, save the collected emails. The easiest way is to write them to JSON:

with open("results.json", "w", encoding="utf-8") as json_file:
    json.dump(results, json_file, ensure_ascii=False, indent=2)
Here’s how to save to CSV as well:
with open("results.csv", "w", newline="", encoding="utf-8") as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(["url", "email"])  
    for result in results:
        for email in result["emails"]:
            writer.writerow([result["url"], email])

Other formats aren’t worth the trouble.

TL;DR

If you got lost, skipped stuff, or just want the code, this part is for you.

Email Scraper with HasData’s API

Here’s the full scraper code:

import requests
import json
import csv

api_key = "YOUR-API-KEY"

headers = {
    'Content-Type': 'application/json',
    'x-api-key': api_key
}

results = []

with open("urls.txt", "r", encoding="utf-8") as file:
    urls = [line.strip() for line in file if line.strip()]

for url in urls:
    payload = json.dumps({
        "url": url,
        "proxyType": "datacenter",
        "proxyCountry": "US",
        "jsRendering": True,
        "extractEmails": True,
    })

    try:
        response = requests.post("https://api.hasdata.com/scrape/web", headers=headers, data=payload)
        response.raise_for_status()
        data = response.json()
        emails = data.get("emails", [])

        results.append({
            "url": url,
            "emails": emails
        })

    except Exception as e:
        results.append({
            "url": url,
            "emails": []
        })

with open("results.json", "w", encoding="utf-8") as json_file:
    json.dump(results, json_file, ensure_ascii=False, indent=2)

with open("results.csv", "w", newline="", encoding="utf-8") as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(["url", "email"])  
    for result in results:
        for email in result["emails"]:
            writer.writerow([result["url"], email])

I also added try..except blocks to catch errors.

Email Scraper with Regex

If you’re anti-API and into the hardcore way, here’s the code:

import requests
import re
import csv


found_emails = set()
output_file = "found_emails.csv"
file_path = "urls.txt"


with open(file_path, "r", encoding="utf-8") as file:
    websites = [line.strip() for line in file if line.strip()]


email_pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}"
for website in websites:
    response = requests.get(website, timeout=10)
    if response.status_code == 200:
        emails = re.findall(email_pattern, response.text)
        for email in emails:
            found_emails.add((website, email))
    else:
        print(f"[{response.status_code}] {website}")


with open(output_file, "w", newline="", encoding="utf-8") as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(["Website", "Email"])
    for website, email in found_emails:
        writer.writerow([website, email])


print(f"Saved {len(found_emails)} emails to {output_file}")

But be ready – this won’t work on every site. For improvement, switch requests for Selenium or any tool that mimics real user actions.

Bonus: Ready-to-Use Email Scraper Tool

Want to skip coding? Or just see how the tutorial code can be leveled up?
Try the free Email Scraper Tool built with Streamlit + HasData API. It works with Google Search, Maps, and raw URLs, and exports to CSV/JSON.

That’s all. Now go scrape what you need. Just don’t overdo it.

DEV Community

How to Build an Email Scraper (with Code + Free Tool)

Step 1: Get the Page Source Code

Option A: Use a Scraping API

Option B: Use Python’s `requests` Library

Step 2: Extract Email Addresses

Option A: Using the API’s Built-in Email Extractor

Option B: Use Regex to Extract Emails from HTML

Step 3: Loop Through Multiple URLs

Step 4: Save the Results

TL;DR

Email Scraper with HasData’s API

Email Scraper with Regex

Bonus: Ready-to-Use Email Scraper Tool

Top comments (0)

Step 1: Get the Page Source Code

Option A: Use a Scraping API

Option B: Use Python’s requests Library

Step 2: Extract Email Addresses

Option A: Using the API’s Built-in Email Extractor

Option B: Use Regex to Extract Emails from HTML

Step 3: Loop Through Multiple URLs

Step 4: Save the Results

TL;DR

Email Scraper with HasData’s API

Email Scraper with Regex

Bonus: Ready-to-Use Email Scraper Tool

Option B: Use Python’s `requests` Library