Chandan Kumar

Posted on Jun 9

How to Check Broken Links Using Python (3 Ways)

#python #webdev #api

Nobody likes broken links. It hurts your SEO, and frustrate users. You can catch them before your visitors (or Google) do.

In this post, I'll walk through three practical ways to check broken links with Python.

What Counts as a Broken Link?

Before writing code, let's define the target. A link is broken when the server responds with:

4xx errors — client errors like 404 Not Found or 410 Gone
5xx errors — server errors like 500 Internal Server Error or 503 Service Unavailable
Connection failures — DNS resolution issues, timeouts, refused connections

Anything in the 2xx range is healthy, and 3xx redirects are usually fine.

Method 1: Status Check with `requests`

This is the simplest approach. If you have a list of URLs and just want to know which ones are dead, requests does the job.

import requests

urls = [
    "https://example.com",
    "https://example.com/nonexistent-page",
    "https://httpstat.us/500",
    "https://httpstat.us/itwasworkingyesterday",
]

def check_link(url):
    try:
        response = requests.head(url, allow_redirects=True, timeout=10)
        if response.status_code >= 400:
            response = requests.get(url, allow_redirects=True, timeout=10)
        return url, response.status_code, response.status_code < 400
    except requests.RequestException as e:
        return url, None, False

for url in urls:
    link, status, is_ok = check_link(url)
    state = "OK" if is_ok else "BROKEN"
    print(f"[{state}] {link} -> {status}")

Why this works: We try a HEAD request first because it's lightweight. It asks for headers only, not the body. If the server blocks HEAD, we fall back to GET.

When to use it: Quick audits, validating a CSV of links, or a CI step that checks links in your docs.

Limitation: It checks one URL at a time and doesn't discover links on a page. For a big list, it's slow.

Method 2: Faster with Concurrency

Checking hundreds of links sequentially is painful. Network requests spend most of their time waiting, so concurrency gives you a massive speedup. Here's a version using concurrent.futures.

import requests
from concurrent.futures import ThreadPoolExecutor, as_completed

urls = [
    "https://example.com",
    "https://example.com/broken",
    "https://httpstat.us/404",
    "https://httpstat.us/200",
]

def check_link(url):
    try:
        response = requests.head(url, allow_redirects=True, timeout=10)
        if response.status_code >= 400:
            response = requests.get(url, allow_redirects=True, timeout=10)
        return url, response.status_code, response.status_code < 400
    except requests.RequestException:
        return url, None, False

broken = []
with ThreadPoolExecutor(max_workers=20) as executor:
    futures = {executor.submit(check_link, url): url for url in urls}
    for future in as_completed(futures):
        url, status, is_ok = future.result()
        if not is_ok:
            broken.append((url, status))
        print(f"[{'OK' if is_ok else 'BROKEN'}] {url} -> {status}")

print(f"\nFound {len(broken)} broken link(s).")

Why this works: ThreadPoolExecutor runs up to 20 requests in parallel. Since link checking is I/O-bound, threads are perfect here. No need for asyncio unless you're checking tens of thousands of URLs.

One tip: Hammering a single domain with 20 concurrent requests can get you blocked. Add a small delay or cap concurrency per host if you're scanning one site.

Method 3: Crawl an Entire Page and Check All Links

The previous methods assume you already have a list of URLs. But usually you want to scan a live page, extract every link, and validate each one. For that, pair requests with BeautifulSoup.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from concurrent.futures import ThreadPoolExecutor, as_completed

def extract_links(page_url):
    response = requests.get(page_url, timeout=10)
    soup = BeautifulSoup(response.text, "html.parser")
    links = set()
    for tag in soup.find_all("a", href=True):
        href = tag["href"]
        full_url = urljoin(page_url, href)
        # Only keep http/https links
        if urlparse(full_url).scheme in ("http", "https"):
            links.add(full_url)
    return links

def check_link(url):
    try:
        response = requests.head(url, allow_redirects=True, timeout=10)
        if response.status_code >= 400:
            response = requests.get(url, allow_redirects=True, timeout=10)
        return url, response.status_code, response.status_code < 400
    except requests.RequestException:
        return url, None, False

page = "https://example.com"
links = extract_links(page)
print(f"Found {len(links)} links on {page}\n")

with ThreadPoolExecutor(max_workers=20) as executor:
    futures = {executor.submit(check_link, link): link for link in links}
    for future in as_completed(futures):
        url, status, is_ok = future.result()
        if not is_ok:
            print(f"[BROKEN] {url} -> {status}")

Why this works: BeautifulSoup parses the HTML, urljoin turns relative paths (/about) into absolute URLs, and we reuse our concurrent checker to validate everything we found.

Where it gets hard: This is a single page. To crawl an entire site, you'd need a queue, a visited-set to avoid loops, depth limits, robots.txt handling, and logic to stay on your own domain. That's a real project, not a snippet.

When Writing Code Isn't Worth It

The scripts above are great for small jobs and learning. But once you try to monitor a real website, the hidden costs pile up:

CAPTCHAs and bot detection — many sites will block or challenge an automated scraper.
Proxies and IP rotation — scan at scale and your IP gets throttled or banned.
JavaScript-rendered pages — requests only sees raw HTML. Links injected by JavaScript require a headless browser, which is heavier and slower.
Maintenance — sites change, edge cases appear, scheduling needs to run reliably, and someone has to keep the whole thing alive.

You're maintaining infrastructure.

If you'd rather skip all that, the Broken Link Checker API handles things for you. You send a URL, it crawls the page, deals with proxies, CAPTCHAs, and rendering behind the scenes, and returns a list of broken links. Here's how simple the call is:

# pip install geekflare-api
from geekflare_api.client import GeekflareClient
from geekflare_api.models import BrokenLinkDto

with GeekflareClient(api_key="<api-key>") as client:
    result = client.broken_link(
        BrokenLinkDto(
            url="https://example.com"
        )
    )
    print(result)

It's the pragmatic choice when you want results, not a maintenance burden.

Which Method Should You Pick?

Quick one-off check? Use Method 1.
Checking a big list of URLs? Use Method 2 for the speed.
Scanning links on a specific page? Use Method 3.
Monitoring a real site continuously, or dealing with CAPTCHAs/proxies/JS? Use the Geekflare API and save yourself the headache.

Broken link checking is one of those tasks that looks trivial until it isn't. Start small with a script, and reach for a managed API the moment your needs outgrow it.

DEV Community

How to Check Broken Links Using Python (3 Ways)

What Counts as a Broken Link?

Method 1: Status Check with `requests`

Method 2: Faster with Concurrency

Method 3: Crawl an Entire Page and Check All Links

When Writing Code Isn't Worth It

Which Method Should You Pick?

Top comments (0)

What Counts as a Broken Link?

Method 1: Status Check with requests

Method 2: Faster with Concurrency

Method 3: Crawl an Entire Page and Check All Links

When Writing Code Isn't Worth It

Which Method Should You Pick?

Method 1: Status Check with `requests`