MissingLink is a script I created to monitor a network of sites for missing links (internal or external), scripts, and images. I found it hard to monitor for broken links multiple sites centered around long-form content. One morning, I looked into automating the process with Python and discovered it would be a lot easier than I had expected.
It’s also just plain old fun to be able to automate these checks using open source tools like PyLinkValidator. For me, creating MissingLink was a great way to get a little deeper into using Shell Scripts, Python Packages, and my web server, so I decided to write a little breakdown of how it works, and how you can use it yourself.
Before following any of my instructions, you’re going to want to make sure you have Python and Pip installed. In addition, if you’re on Windows you’ll want to use a terminal application like Git Bash so you can easily use the command line to interact with the script.
When you’re all set up and ready to go, you can begin by installing bartdag’s PyLinkValidator package for Python.
pip install pylinkvalidator
Once you have installed PyLinkValidator, clone the MissingLink GitHub repo.
git clone https://github.com/knightspore/missing-link.git
When the download finishes, take a look at the files and folders – purely for your own interest. You’ll want to start with an edit of the cfg/domains.txt file. Nano is a simple command-line text editor – I’ve provided alternatives below so that you can use your preferred editor to open the file.
# Edit in Terminal nano cfg/domains.txt # Edit with Visual Studio Code code cfg/domains.txt # Edit with Sublime Text subl cfg/domains.txt # Edit with Atom atom cfg/domains.txt
Domains.txt is the file which stores the list of domains you want to crawl with MissingLink. Simply add the target domains in the format “website.com” as pictured below.
It’s also important to remember that you need to leave one blank line at the end of the file for it to work correctly – so after you’ve entered your last target domain, make sure to hit return (enter) before you save your changes.
Now that you’re all set, execute run.sh in the main directory.
Give the crawl a little while to finish (it takes longer on larger sites, naturally) and you’ll find two new folders in your directory: crawl and report.
If you don’t see these folders, or if the report is blank, there’s a chance that you’ve avoided having any broken links on your site! While this is a good thing, it does make for a particularly unspectacular demonstration of MissingLink.
Go ahead and add a link to one of your pages which leads to a 404, just so you can see some results. I usually like to use radicalturnip.com.
Note: if you’re crawling many domains and the computer you’re using isn’t very powerful, this might take a while. I like to use one of my Vultr servers to run these checks. I can spin up a new server in minutes, SSH in, and start working from there to avoid putting a strain on my mid-range laptop.
SEO health checks should be part of any maintenance routine – and broken links aren’t the best thing for Googlebot to stumble across while crawling a page. While there are plenty of plugins that can notify you when a link breaks, many of us want to keep our sites running with as few plugins as possible to save on performance.
Learning to do these checks manually is important – even if, in this case, that means browsing to a page and clicking on each link to see what’s broken (and why).
Understanding how to find the errors that you’re looking for will help you better use third-party tools like MissingLink for SEO checks. You can get a better understanding of the kind of errors you’re looking for, and where to refine the parameters of your tools so you don’t end up with massive reports and crawl logs to sift through each week.
By now your crawl should have finished. As I mentioned before, your report-ready list of broken links will be available in the /report folder – as well as a full crawl log available in /crawl. You will probably only deal with the former, but take a look at your full crawl logs if you want a better idea of what’s happening behind the scenes.
Luckily I only had one broken link on parabyl.co.za – it’s a link to the Instagram account of a 3D artist friend of mine, who did cover art for my last album.
Broken Links for parabyl | 2020-08-04 not found (404): https://www.instagram.com/scumboy.pdf/?hl=en
Now, I can simply go and replace that link so that Scumboy can be correctly credited (and to make sure search engines don’t knock me down in the ranks for having broken links). But the question is, where does this broken link exist?
This is where the crawl logs come in handy. If I navigate to crawl/parabyl/parabyl-links.txt I can see a neat breakdown of all the pages where this broken link was found.
not found (404): https://www.instagram.com/scumboy.pdf/?hl=en from https://parabyl.co.za from https://parabyl.co.za/
Now, I’m ready to fix the link!
Out of interest – the reason my domain appears to be “listed twice” here is that the URL I entered had no trailing slash – and as such was redirected to one with a trailing slash. While trailing slashes themselves are not recommended for SEO purposes, the fact that it resolves to one specific option is a good thing – and an important feature you can implement using Yoast on WordPress.
We’ve reached the end of part one of this tutorial – in part two, we’ll look at bash scripting in more detail in order to understand how I set limits and instructions that can be reused to ‘standardize’ a crawl with PyLinkValidator.
This will cover some great basic terminal commands which can help you become a little faster with the command line, which in turn can help you become a bit more productive in your everyday work.
If I’ve piqued your interest, go ahead and check out some of my other projects – you might be able to learn something new, or even help me out with something I’m struggling with.