DEV Community

Nico Reyes
Nico Reyes

Posted on

I wasted a weekend building a link checker. Manual clicking would've been faster.

Found 800 URLs in my old blog export. Some were broken. Clicking manually was not happening.

Started simple: fetch each URL, check if it returns an error.

import requests

with open('urls.txt') as f:
    urls = [line.strip() for line in f]

for url in urls:
    r = requests.get(url, timeout=10)
    if r.status_code >= 400:
        print(f'broken: {url}')
Enter fullscreen mode Exit fullscreen mode

Ran it. Got rate limited after 20 requests. Added a delay.

import time

for url in urls:
    r = requests.get(url, timeout=10)
    if r.status_code >= 400:
        print(f'broken: {url}')
    time.sleep(1)
Enter fullscreen mode Exit fullscreen mode

Better. Site stopped blocking me. Script ran for a few hours and found 47 broken links.

Cool right? Done.

Except I went back to manually check a few of the "working" URLs and some were clearly dead. What.

Turns out the script was stopping at redirects. If URL A redirected to URL B, it only checked if A existed. URL B could be 404 and nobody would know.

So I rewrote it to follow the whole chain.

import requests

def check_url(url):
    current = url
    while True:
        r = requests.get(current, timeout=10, allow_redirects=False)
        if r.status_code >= 400:
            return False, current
        if not r.is_redirect:
            return True, None
        current = r.headers['Location']

for url in urls:
    works, failed_at = check_url(url)
    if not works:
        print(f'broken at {failed_at}: {url}')
Enter fullscreen mode Exit fullscreen mode

This traces the full redirect chain. If anything in the chain returns an error, it reports which URL actually failed.

Found about 20 more broken links that the first version missed. These were URLs that technically "worked" but only because they redirected somewhere that didn't exist.

So yeah. 67 broken links total instead of 47. Fun times.

The script takes forever now. Hours for 800 URLs. But it's accurate, and I'll have it for the next time I need to check links.

Manual clicking would've taken maybe 2 hours. This took a full weekend.

Not worth it. But I have a script that works now so whatever, next time will be faster

Top comments (0)