DEV Community

Robin Winslow
Robin Winslow

Posted on • Originally published at robinwinslow.uk

3 1

How to use unix linkchecker to thoroughly check any site

Originally posted at robinwinslow.uk/linkchecker

If you want to a tool to crawl through your site looking for 404 or 500 errors, there are online tools (e.g. The W3C's online link checker), browser plugins for Firefox and Chrome, or windows programs like Xenu's Link Sleuth.

A unix link checker

Today I found linkchecker - available as a unix command-line program (although it also has a GUI or a web interface).

Install the command-line tool

You can install the command-line tool simply on Ubuntu:

sudo apt-get install linkchecker
Enter fullscreen mode Exit fullscreen mode

Using linkchecker

Like any good command-line program, it has a manual page, but it can be a bit daunting to read, so I give some shortcuts below.

By default, linkchecker will give you a lot of warnings. It'll warn you for any links that result in 301s, as well as all 404s, timeouts, etc., as well as giving you status updates every second or so.

Robots.txt

linkchecker will not crawl a website that is disallowed by a robots.txt file, and there's no way to override that. The solution is to change the robots.txt file to allow linkchecker through:

User-Agent: *
Disallow: /
User-Agent: LinkChecker
Allow: /
Enter fullscreen mode Exit fullscreen mode

Redirecting output

linkchecker seems to be expecting you to redirect its output to a file. If you do so, it will only put the actual warnings and errors in the file, and report status to the command-line:

$ linkchecker http://example.com > siteerrors.log
35 URLs active,     0 URLs queued, 13873 URLs checked, runtime 1 hour, 51 minutes
Enter fullscreen mode Exit fullscreen mode

Timeout

If you're testing a development site, it's quite likely it will be fairly slow to respond and linkchecker may experience many timeouts, so you probably want to up that timeout time:

$ linkchecker --timeout=300 http://example.com > siteerrors.log
Enter fullscreen mode Exit fullscreen mode

Ignore warnings

I don't know about you, but the sites I work on have loads of errors. I want to find 404s and 50*s before I worry about redirect warnings.

$ linkchecker --timeout=300 --no-warnings http://example.com > siteerrors.log
Enter fullscreen mode Exit fullscreen mode

Output type

The default text output is fairly verbose. For easy readability, you probably want the logging to be in CSV format:

$ linkchecker --timeout=300 --no-warnings -ocsv http://example.com > siteerrors.csv
Enter fullscreen mode Exit fullscreen mode

Other options

If you find and fix all your basic 404 and 50* errors, you might then want to turn warnings back on (remove --no-warnings) and start using --check-html and --check-css.

Checking websites with OpenID (2014-04-17 update)

Today I had to use linkchecker to check a site which required authentication with Canonical's OpenID system. To do this, a StackOverflow answer helped me immensely.

I first accessed the site as normal with Chromium, opened the console window and dumped all the cookies that were set in that site:

> document.cookie
"__utmc="111111111"; pysid=1e53e0a04bf8e953c9156ea841e41157;"
Enter fullscreen mode Exit fullscreen mode

I then saved these cookies in cookies.txt in a format that linkchecker will understand:

Host:example.com
Set-cookie: __utmc="111111111"
Set-cookie: pysid="1e53e0a04bf8e953c9156ea841e41157"
Enter fullscreen mode Exit fullscreen mode

And included it in my linkchecker command with --cookiefile:

linkchecker --cookiefile=cookies.txt --timeout=300 --no-warnings -ocsv http://example.com > siteerrors.csv
Enter fullscreen mode Exit fullscreen mode

Use it!

If you work on a website of any significant size, there are almost certainly dozens of broken links and other errors. Link checkers will crawl through the website checking each link for errors.

Link checking your website may seem obvious, but in my experience hardly any dev teams do it regularly.

You might well want to use linkchecker to do automated link checking! I haven't implemented this yet, but I'll try to let you know when I do.

Heroku

Simplify your DevOps and maximize your time.

Since 2007, Heroku has been the go-to platform for developers as it monitors uptime, performance, and infrastructure concerns, allowing you to focus on writing code.

Learn More

Top comments (0)

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

👋 Kindness is contagious

Dive into an ocean of knowledge with this thought-provoking post, revered deeply within the supportive DEV Community. Developers of all levels are welcome to join and enhance our collective intelligence.

Saying a simple "thank you" can brighten someone's day. Share your gratitude in the comments below!

On DEV, sharing ideas eases our path and fortifies our community connections. Found this helpful? Sending a quick thanks to the author can be profoundly valued.

Okay