Found 800 URLs in my old blog export. Some were broken. Clicking manually was not happening.
Started simple: fetch each URL, check if it returns an error.
import requests
with open('urls.txt') as f:
urls = [line.strip() for line in f]
for url in urls:
r = requests.get(url, timeout=10)
if r.status_code >= 400:
print(f'broken: {url}')
Ran it. Got rate limited after 20 requests. Added a delay.
import time
for url in urls:
r = requests.get(url, timeout=10)
if r.status_code >= 400:
print(f'broken: {url}')
time.sleep(1)
Better. Site stopped blocking me. Script ran for a few hours and found 47 broken links.
Cool right? Done.
Except I went back to manually check a few of the "working" URLs and some were clearly dead. What.
Turns out the script was stopping at redirects. If URL A redirected to URL B, it only checked if A existed. URL B could be 404 and nobody would know.
So I rewrote it to follow the whole chain.
import requests
def check_url(url):
current = url
while True:
r = requests.get(current, timeout=10, allow_redirects=False)
if r.status_code >= 400:
return False, current
if not r.is_redirect:
return True, None
current = r.headers['Location']
for url in urls:
works, failed_at = check_url(url)
if not works:
print(f'broken at {failed_at}: {url}')
This traces the full redirect chain. If anything in the chain returns an error, it reports which URL actually failed.
Found about 20 more broken links that the first version missed. These were URLs that technically "worked" but only because they redirected somewhere that didn't exist.
So yeah. 67 broken links total instead of 47. Fun times.
The script takes forever now. Hours for 800 URLs. But it's accurate, and I'll have it for the next time I need to check links.
Manual clicking would've taken maybe 2 hours. This took a full weekend.
Not worth it. But I have a script that works now so whatever, next time will be faster
Top comments (0)