Nobody likes broken links. It hurts your SEO, and frustrate users. You can catch them before your visitors (or Google) do.
In this post, I'll walk through three practical ways to check broken links with Python.
What Counts as a Broken Link?
Before writing code, let's define the target. A link is broken when the server responds with:
-
4xx errors — client errors like
404 Not Foundor410 Gone -
5xx errors — server errors like
500 Internal Server Erroror503 Service Unavailable - Connection failures — DNS resolution issues, timeouts, refused connections
Anything in the 2xx range is healthy, and 3xx redirects are usually fine.
Method 1: Status Check with requests
This is the simplest approach. If you have a list of URLs and just want to know which ones are dead, requests does the job.
import requests
urls = [
"https://example.com",
"https://example.com/nonexistent-page",
"https://httpstat.us/500",
"https://httpstat.us/itwasworkingyesterday",
]
def check_link(url):
try:
response = requests.head(url, allow_redirects=True, timeout=10)
if response.status_code >= 400:
response = requests.get(url, allow_redirects=True, timeout=10)
return url, response.status_code, response.status_code < 400
except requests.RequestException as e:
return url, None, False
for url in urls:
link, status, is_ok = check_link(url)
state = "OK" if is_ok else "BROKEN"
print(f"[{state}] {link} -> {status}")
Why this works: We try a HEAD request first because it's lightweight. It asks for headers only, not the body. If the server blocks HEAD, we fall back to GET.
When to use it: Quick audits, validating a CSV of links, or a CI step that checks links in your docs.
Limitation: It checks one URL at a time and doesn't discover links on a page. For a big list, it's slow.
Method 2: Faster with Concurrency
Checking hundreds of links sequentially is painful. Network requests spend most of their time waiting, so concurrency gives you a massive speedup. Here's a version using concurrent.futures.
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
urls = [
"https://example.com",
"https://example.com/broken",
"https://httpstat.us/404",
"https://httpstat.us/200",
]
def check_link(url):
try:
response = requests.head(url, allow_redirects=True, timeout=10)
if response.status_code >= 400:
response = requests.get(url, allow_redirects=True, timeout=10)
return url, response.status_code, response.status_code < 400
except requests.RequestException:
return url, None, False
broken = []
with ThreadPoolExecutor(max_workers=20) as executor:
futures = {executor.submit(check_link, url): url for url in urls}
for future in as_completed(futures):
url, status, is_ok = future.result()
if not is_ok:
broken.append((url, status))
print(f"[{'OK' if is_ok else 'BROKEN'}] {url} -> {status}")
print(f"\nFound {len(broken)} broken link(s).")
Why this works: ThreadPoolExecutor runs up to 20 requests in parallel. Since link checking is I/O-bound, threads are perfect here. No need for asyncio unless you're checking tens of thousands of URLs.
One tip: Hammering a single domain with 20 concurrent requests can get you blocked. Add a small delay or cap concurrency per host if you're scanning one site.
Method 3: Crawl an Entire Page and Check All Links
The previous methods assume you already have a list of URLs. But usually you want to scan a live page, extract every link, and validate each one. For that, pair requests with BeautifulSoup.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from concurrent.futures import ThreadPoolExecutor, as_completed
def extract_links(page_url):
response = requests.get(page_url, timeout=10)
soup = BeautifulSoup(response.text, "html.parser")
links = set()
for tag in soup.find_all("a", href=True):
href = tag["href"]
full_url = urljoin(page_url, href)
# Only keep http/https links
if urlparse(full_url).scheme in ("http", "https"):
links.add(full_url)
return links
def check_link(url):
try:
response = requests.head(url, allow_redirects=True, timeout=10)
if response.status_code >= 400:
response = requests.get(url, allow_redirects=True, timeout=10)
return url, response.status_code, response.status_code < 400
except requests.RequestException:
return url, None, False
page = "https://example.com"
links = extract_links(page)
print(f"Found {len(links)} links on {page}\n")
with ThreadPoolExecutor(max_workers=20) as executor:
futures = {executor.submit(check_link, link): link for link in links}
for future in as_completed(futures):
url, status, is_ok = future.result()
if not is_ok:
print(f"[BROKEN] {url} -> {status}")
Why this works: BeautifulSoup parses the HTML, urljoin turns relative paths (/about) into absolute URLs, and we reuse our concurrent checker to validate everything we found.
Where it gets hard: This is a single page. To crawl an entire site, you'd need a queue, a visited-set to avoid loops, depth limits, robots.txt handling, and logic to stay on your own domain. That's a real project, not a snippet.
When Writing Code Isn't Worth It
The scripts above are great for small jobs and learning. But once you try to monitor a real website, the hidden costs pile up:
- CAPTCHAs and bot detection — many sites will block or challenge an automated scraper.
- Proxies and IP rotation — scan at scale and your IP gets throttled or banned.
-
JavaScript-rendered pages —
requestsonly sees raw HTML. Links injected by JavaScript require a headless browser, which is heavier and slower. - Maintenance — sites change, edge cases appear, scheduling needs to run reliably, and someone has to keep the whole thing alive.
You're maintaining infrastructure.
If you'd rather skip all that, the Broken Link Checker API handles things for you. You send a URL, it crawls the page, deals with proxies, CAPTCHAs, and rendering behind the scenes, and returns a list of broken links. Here's how simple the call is:
# pip install geekflare-api
from geekflare_api.client import GeekflareClient
from geekflare_api.models import BrokenLinkDto
with GeekflareClient(api_key="<api-key>") as client:
result = client.broken_link(
BrokenLinkDto(
url="https://example.com"
)
)
print(result)
It's the pragmatic choice when you want results, not a maintenance burden.
Which Method Should You Pick?
- Quick one-off check? Use Method 1.
- Checking a big list of URLs? Use Method 2 for the speed.
- Scanning links on a specific page? Use Method 3.
- Monitoring a real site continuously, or dealing with CAPTCHAs/proxies/JS? Use the Geekflare API and save yourself the headache.
Broken link checking is one of those tasks that looks trivial until it isn't. Start small with a script, and reach for a managed API the moment your needs outgrow it.
Top comments (0)