Mate Technologies

Posted on Jan 20

Build Your Own Email Scraper with Python – Step by Step 🐍

#tutorial #python #scrapes #beautifulsoup

Ever wanted to find public email addresses online automatically? In this tutorial, we’ll build EmailScout – Public Contact Finder, a Python tool that searches Google, scrapes pages, and exports results. We’ll break it down so even beginners can follow!

GitHub repo for this project: EmailScout on GitHub

Step 1: Setup & Install Dependencies

We’ll use a few libraries:

tkinter for the GUI

ttkbootstrap for styling

requests for HTTP requests

BeautifulSoup for parsing HTML

re for regex email matching

Install the extra packages with pip:

pip install ttkbootstrap beautifulsoup4 requests

Step 2: Import Libraries

Start your script by importing everything you’ll need:

import tkinter as tk
from tkinter import messagebox, filedialog
import ttkbootstrap as tb
from ttkbootstrap.widgets.scrolled import ScrolledText
import threading
import time
import json
import csv
import requests
import re
import os
import sys
from collections import defaultdict
from bs4 import BeautifulSoup

Explanation:
These imports give us GUI tools, threading for running tasks in the background, and libraries to handle HTTP requests and HTML parsing.

Step 3: Setup Basic Variables & Regex

We need a regex pattern to detect emails and a place to store results:

HEADERS = {"User-Agent": "Mozilla/5.0"}
SEARCH_URL = "https://www.google.com/search"

emails_found = set()
sources = defaultdict(list)
stop_event = threading.Event()
scrape_completed = False

# Regex pattern for emails
EMAIL_REGEX = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")

Explanation:

emails_found stores unique emails.

sources keeps track of where each email was found.

EMAIL_REGEX matches common email formats.

stop_event allows us to stop the scraper mid-run.

Step 4: Create the GUI Window

We use ttkbootstrap to make a styled window:

app = tb.Window("EmailScout – Public Contact Finder", themename="superhero", size=(1300, 680))
app.grid_columnconfigure(0, weight=1)
app.grid_rowconfigure(1, weight=1)

Explanation:

This sets up a resizable window with a “superhero” theme.

grid_columnconfigure and grid_rowconfigure make the layout flexible.

Step 5: Input Section – Enter Keywords

Users need to input search queries:

input_card = tb.Labelframe(app, text="Search Keywords", padding=15)
input_card.grid(row=0, column=0, sticky="nsew", padx=10, pady=10)

tb.Label(input_card, text="One search per line (e.g. 'AI developer contact email')").pack(anchor="w")
keywords_input = ScrolledText(input_card, height=7)
keywords_input.pack(fill="both", expand=True)

Explanation:

A Labelframe organizes the input area.

ScrolledText allows multi-line input with scrollbars.

Step 6: Output Section – Live Results

We want users to see emails as they are found:

output_card = tb.Labelframe(app, text="Live Results", padding=15)
output_card.grid(row=1, column=0, sticky="nsew", padx=10, pady=10)

log = ScrolledText(output_card)
log.pack(fill="both", expand=True)
log.text.config(state="disabled")

Explanation:

This is a read-only scrollable text area.

We’ll append new emails to this as they are scraped.

Step 7: Footer Buttons

Add buttons for Start, Stop, and Export:

footer = tb.Frame(app)
footer.grid(row=2, column=0, sticky="ew", padx=10, pady=5)

start_btn = tb.Button(footer, text="Start", bootstyle="success", width=18)
start_btn.pack(side="left", padx=5)

stop_btn = tb.Button(footer, text="Stop", bootstyle="danger", width=15)
stop_btn.pack(side="left", padx=5)
stop_btn.config(state="disabled")

export_txt = tb.Button(footer, text="Export TXT", width=15)
export_txt.pack(side="left", padx=5)

export_csv = tb.Button(footer, text="Export CSV", width=15)
export_csv.pack(side="left", padx=5)

export_json = tb.Button(footer, text="Export JSON", width=15)
export_json.pack(side="left", padx=5)

Explanation:

Start begins scraping.

Stop allows interrupting.

Export saves results in different formats.

Step 8: Logging Helper

We need a simple function to log emails in real-time:

def log_line(t):
    log.text.config(state="normal")
    log.text.insert("end", t + "\n")
    log.text.see("end")
    log.text.config(state="disabled")

Step 9: Google Search & Scraper Functions

Here’s the core scraping logic:

def google_search(query):
    params = {"q": query, "num": 5}
    r = requests.get(SEARCH_URL, params=params, headers=HEADERS, timeout=10)
    soup = BeautifulSoup(r.text, "html.parser")
    return [a["href"] for a in soup.select("a") if a.get("href", "").startswith("http")]

def scrape_page(url):
    try:
        r = requests.get(url, headers=HEADERS, timeout=10)
        return set(EMAIL_REGEX.findall(r.text))
    except:
        return set()

Explanation:

google_search finds links from Google.

scrape_page downloads the page and finds emails using regex.

Step 10: Running the Scraper in Threads

We don’t want the GUI to freeze, so we use threads:

def run_scraper(queries):
    global scrape_completed

    for q in queries:
        if stop_event.is_set(): return
        log_line(f"🔍 Searching: {q}")

        urls = google_search(q)
        for url in urls:
            if stop_event.is_set(): return

            emails = scrape_page(url)
            for e in emails:
                if e not in emails_found:
                    emails_found.add(e)
                    sources[e].append(url)
                    log_line(e)

            time.sleep(0.6)

    scrape_completed = True
    messagebox.showinfo("Done", f"Found {len(emails_found)} public emails.")

Explanation:

Loops through queries and URLs.

Adds new emails to the set and logs them.

time.sleep prevents overwhelming servers.

Step 11: Start & Stop Buttons

Connect the buttons to actions:

def start_scraping():
    global scrape_completed
    scrape_completed = False
    stop_event.clear()
    emails_found.clear()
    sources.clear()

    queries = [q.strip() for q in keywords_input.get("1.0", "end").splitlines() if q.strip()]
    if not queries:
        messagebox.showerror("Input Error", "Please enter at least one search query.")
        return

    log.text.config(state="normal")
    log.text.delete("1.0", "end")
    log.text.config(state="disabled")

    stop_btn.config(state="normal")
    start_btn.config(state="disabled")

    threading.Thread(target=run_scraper, args=(queries,), daemon=True).start()

def stop_scraping():
    stop_event.set()
    log_line("⛔ Stopped by user")
    stop_btn.config(state="disabled")
    start_btn.config(state="normal")

Step 12: Exporting Results

Allow users to save emails:

def export_file(fmt):
    if not emails_found or not scrape_completed:
        messagebox.showerror("Export Error", "Nothing to export.")
        return

    path = filedialog.asksaveasfilename(defaultextension=f".{fmt}")
    if not path: return

    if fmt == "txt":
        with open(path, "w") as f:
            for e in sorted(emails_found):
                f.write(e + "\n")

    elif fmt == "csv":
        with open(path, "w", newline="") as f:
            w = csv.writer(f)
            w.writerow(["email", "source"])
            for e, s in sources.items():
                w.writerow([e, ", ".join(s)])

    elif fmt == "json":
        with open(path, "w") as f:
            json.dump(sources, f, indent=2)

    messagebox.showinfo("Exported", "File saved successfully.")

Step 13: Run the App

Finally, start the Tkinter main loop: