🚀 Executive Summary
TL;DR: Broken links negatively impact user experience and SEO, making manual auditing impractical for growing websites. This solution provides an automated Python script to crawl a website, identify dead links, and send weekly health reports via email, ensuring proactive site maintenance.
🎯 Key Takeaways
- The core link-checking script utilizes Python’s
requestslibrary for HTTP requests andBeautifulSoupfor parsing HTML to extract hyperlinks. - Efficiency in link status checking is achieved by employing
HEADrequests instead ofGETrequests, which only fetch headers and status codes without downloading full page content. - Secure handling of sensitive information like email credentials is managed through
python-dotenvand aconfig.envfile, with automation scheduled using cron for weekly execution.
Dead Link Checker: Automate Weekly Site Health Reports via Email
Introduction
Broken links are the silent saboteurs of the web. They degrade user experience, harm your SEO rankings, and create a perception of a neglected, unprofessional website. For any growing platform, manually clicking through every link is an impossible task. The result? “404 Not Found” errors accumulate, frustrating visitors and search engine crawlers alike.
This tutorial provides a robust, automated solution. We will build a Python script that crawls your website, identifies broken links, and sends a concise health report directly to your inbox every week. By investing a little time now, you can put your site’s link health on autopilot, allowing your team to focus on development and content creation instead of manual auditing.
Prerequisites
Before we begin, ensure you have the following ready:
- Python 3.x: The script is written in Python. Ensure it’s installed on the system where you plan to run the checker.
- Python Libraries: We’ll use a few essential libraries. You can install them all with a single command:
pip install requests beautifulsoup4 python-dotenv
- SMTP Server Access: An email account that can send emails via SMTP. For services like Gmail, you’ll need to generate an “App Password” if you have 2-Factor Authentication enabled. Alternatively, a transactional email service like SendGrid or Mailgun is an excellent choice.
- A Server or CI/CD Environment: A place to schedule and run the script, such as a Linux server with cron, a Windows machine with Task Scheduler, or a CI/CD platform like GitHub Actions.
Step-by-Step Guide
Step 1: Create the Core Link-Checking Script
First, let’s build the heart of our tool: a Python script that can crawl a website and identify non-working links. This script will use the requests library to make HTTP requests and BeautifulSoup to parse the HTML and extract all hyperlinks.
Create a file named link_checker.py and add the following code. This version focuses solely on finding the broken links.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
def find_broken_links(site_url):
"""Crawls a website from a starting URL and reports broken links."""
urls_to_visit = {site_url}
visited_urls = set()
broken_links = {}
while urls_to_visit:
url = urls_to_visit.pop()
if url in visited_urls:
continue
print(f"Checking: {url}")
visited_urls.add(url)
try:
response = requests.get(url, timeout=10)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
except requests.exceptions.RequestException as e:
print(f" -> Failed to connect to {url}: {e}")
broken_links[url] = str(e)
continue
soup = BeautifulSoup(response.text, 'html.parser')
for link in soup.find_all('a', href=True):
found_url = link['href']
absolute_url = urljoin(url, found_url)
# Clean the URL (remove fragment identifiers)
parsed_url = urlparse(absolute_url)
clean_url = parsed_url._replace(fragment='').geturl()
if clean_url in visited_urls:
continue
# Check if the link is internal to the site
if urlparse(site_url).netloc in clean_url:
urls_to_visit.add(clean_url)
# Now, check the status of the found link with a HEAD request for efficiency
try:
# Use a HEAD request to avoid downloading the whole page
link_res = requests.head(clean_url, timeout=5, allow_redirects=True)
if link_res.status_code >= 400:
print(f" -> BROKEN: {clean_url} (Status: {link_res.status_code}) found on {url}")
broken_links[clean_url] = f"HTTP {link_res.status_code}"
except requests.exceptions.RequestException:
# Mark as broken if the domain doesn't resolve or times out
print(f" -> BROKEN: {clean_url} (Connection Error) found on {url}")
broken_links[clean_url] = "Connection Error"
return broken_links
if __name__ == "__main__":
target_site = "https://your-website.com" # Replace with your target URL
found_broken_links = find_broken_links(target_site)
if found_broken_links:
print("\n--- Summary of Broken Links ---")
for link, reason in found_broken_links.items():
print(f"{link} - Reason: {reason}")
else:
print("\n--- No broken links found! ---")
Code Logic Explained:
- The script starts with a single URL and maintains a set of URLs to visit (
urls\_to\_visit) and a set of URLs it has already checked (visited\_urls) to avoid infinite loops. - For each internal page, it uses
BeautifulSoupto parse the HTML and find all<a>tags with anhrefattribute. -
urljoinis used to correctly handle relative paths (e.g.,/about-us) and convert them into absolute URLs. - To check if a link is broken, we send a
HEADrequest. This is more efficient than aGETrequest because it only fetches the headers (including the status code) without downloading the entire page content. - Any link that returns a status code of 400 or higher, or results in a connection error, is added to our
broken\_linksdictionary.
Step 2: Add Secure Email Reporting
Now, let’s add the functionality to email the results. It’s critical to handle credentials securely. We’ll use a config.env file to store our sensitive information and the python-dotenv library to load it into our script.
First, create a file named config.env in the same directory:
# -- Site Configuration --
TARGET_URL="https://your-website.com"
# -- SMTP Email Configuration --
SMTP_SERVER="smtp.gmail.com"
SMTP_PORT=587
SENDER_EMAIL="your-email@gmail.com"
SENDER_PASSWORD="your-gmail-app-password"
RECIPIENT_EMAIL="report-recipient@example.com"
Next, update link_checker.py to include the email sending logic.
import os
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from dotenv import load_dotenv
# ... (keep the find_broken_links function from Step 1) ...
def send_email_report(broken_links_report):
"""Sends an email with the scan results."""
load_dotenv('config.env')
sender_email = os.getenv("SENDER_EMAIL")
sender_password = os.getenv("SENDER_PASSWORD")
recipient_email = os.getenv("RECIPIENT_EMAIL")
smtp_server = os.getenv("SMTP_SERVER")
smtp_port = int(os.getenv("SMTP_PORT"))
if not all([sender_email, sender_password, recipient_email, smtp_server, smtp_port]):
print("Email configuration is missing from config.env. Skipping email report.")
return
# Create the email message
message = MIMEMultipart("alternative")
message["Subject"] = "Weekly Website Link Health Report"
message["From"] = sender_email
message["To"] = recipient_email
# Format the report body
if not broken_links_report:
body = "Great news! No broken links were found on the website this week."
else:
body = "The following broken links were detected during the weekly scan:\n\n"
for link, reason in broken_links_report.items():
body += f"- {link} (Reason: {reason})\n"
message.attach(MIMEText(body, "plain"))
# Send the email
try:
with smtplib.SMTP(smtp_server, smtp_port) as server:
server.starttls() # Secure the connection
server.login(sender_email, sender_password)
server.sendmail(sender_email, recipient_email, message.as_string())
print("Email report sent successfully!")
except Exception as e:
print(f"Failed to send email report: {e}")
if __name__ == "__main__":
load_dotenv('config.env')
target_site = os.getenv("TARGET_URL")
if not target_site:
print("TARGET_URL not set in config.env. Exiting.")
else:
found_broken_links = find_broken_links(target_site)
send_email_report(found_broken_links)
if found_broken_links:
print("\n--- Summary of Broken Links ---")
for link, reason in found_broken_links.items():
print(f"{link} - Reason: {reason}")
else:
print("\n--- No broken links found! ---")
Code Logic Explained:
-
load_dotenv('config.env')securely loads the credentials from your configuration file into the environment. - The
send_email_reportfunction formats the list of broken links into a clean, readable message. - It uses Python’s built-in
smtplibto connect to your email provider, authenticates using your credentials, and sends the report. - The main execution block is updated to call both the crawler and the email function in sequence.
Step 3: Schedule the Script with Cron
Automation is the final piece of the puzzle. We’ll use cron, a standard time-based job scheduler on Unix-like operating systems, to run our script automatically once a week.
First, ensure your Python script is executable. Then, open your cron scheduler to add a new job.
# Open your cron scheduler for editing
# (You can typically do this with 'crontab -e' in your terminal)
Add the following line to schedule the script to run every Monday at 2:00 AM. Adjust the path to where you saved your script.
# Run the link checker every Monday at 2 AM
0 2 * * 1 cd /path/to/your/project && python3 link_checker.py
Cron Syntax Explained:
-
0 2 * * 1: This represents “at minute 0, of hour 2, on any day of the month, in any month, on the 1st day of the week (Monday)”. -
cd /path/to/your/project: This is a crucial step. It changes the directory to where your script andconfig.envfile are located. This ensures the script can find its configuration file correctly. -
&& python3 link_checker.py: The&&ensures the script only runs if thecdcommand was successful. It executes your script using the Python 3 interpreter.
Common Pitfalls
Even with a solid setup, you might encounter a few issues:
-
Aggressive Crawling and IP Blocking: If your script makes too many requests in a short period, a Web Application Firewall (WAF) or rate-limiting service might temporarily block its IP address. To mitigate this, consider adding a small delay inside your crawling loop (e.g.,
import time; time.sleep(0.5)) between requests to appear more like a human user. - Email Authentication Issues: SMTP servers can be picky. If emails fail to send, double-check your credentials. If using Gmail, ensure you have generated and are using a dedicated “App Password” rather than your main account password, especially if 2FA is active. Also, confirm the SMTP server and port are correct for your provider.
Conclusion
You have now built a powerful, automated system for maintaining your website’s link integrity. This script works tirelessly in the background, giving you weekly peace of mind and freeing up valuable engineering time. By catching broken links proactively, you not only improve your site’s SEO performance and user trust but also establish a foundation for scalable, high-quality web maintenance.
From here, you can expand the script to check for broken images, integrate its output with a monitoring dashboard, or add more sophisticated error handling and retry logic. Welcome to the world of proactive site health automation!
👉 Read the original article on TechResolve.blog
☕ Support my work
If this article helped you, you can buy me a coffee:

Top comments (0)