Command Line Warrior - filter out non-existing URLs

#commandline #linux #bash #scripting

Problem description

Given a file of URLs, how can we filter out any that gives us HTTP 404 Not Found and write the URLs that exist to a new file?

TL;DR

cat urls.txt \
  | xargs -I{} sh -c 'curl -sIL {} -w "%{http_code}" -o /dev/null \
    | grep -q -v 404 && echo {}' > ok_urls.txt

Explanation

First, we pipe the list of URLs to xargs using cat:

cat urls.txt | xargs ...

Using xargs we read the input from cat and execute a shell command for each line. The -I{} tells xargs that we want to replace the string {} with the input (in this case a URL).

Since we need to output the URL we got as an input, we will actually use this twice: once first when checking the URL, second when outputting the URL that was valid.

To run multiple commands for each line, we tell xargs to run a shell with another command, as specified with -c.

In the next part of the script, we first use curl to access the URL, telling it to be silent and don't give us output we don't need with -s, only give us the header -I and follow any redirects -L. To get the status code only, we use -w "%{http_code}". This flag can be used to tailor the output from curl. -o /dev/null sends any other output to somewhere it can be discarded.

To filter out 404 Not Found we can use grep -v which will match lines that do not contain 404, and -q makes grep be quiet. This way we can test only on the exit value of grep.

If you combine two commands with &&, the last one will only be run if the previous command was successful - so by putting && echo {} after grep, the URL will only be printed if grep was successful. Remember that {} is replaced with the URL by xargs!

Finally, we send the list of URLs to a new file, and we're done!

Happy hacking,
Vetle

DEV Community

Command Line Warrior - filter out non-existing URLs

Problem description

TL;DR

Explanation

Top comments (0)