Problem description
Given a file of URLs, how can we filter out any that gives us HTTP 404 Not Found and write the URLs that exist to a new file?
TL;DR
cat urls.txt \
| xargs -I{} sh -c 'curl -sIL {} -w "%{http_code}" -o /dev/null \
| grep -q -v 404 && echo {}' > ok_urls.txt
Explanation
First, we pipe the list of URLs to xargs using cat:
cat urls.txt | xargs ...
Using xargs we read the input from cat and execute a shell command for each line. The -I{} tells xargs that we want to replace the string {} with the input (in this case a URL).
Since we need to output the URL we got as an input, we will actually use this twice: once first when checking the URL, second when outputting the URL that was valid.
To run multiple commands for each line, we tell xargs to run a shell with another command, as specified with -c.
In the next part of the script, we first use curl to access the URL, telling it to be silent and don't give us output we don't need with -s, only give us the header -I and follow any redirects -L. To get the status code only, we use -w "%{http_code}". This flag can be used to tailor the output from curl. -o /dev/null sends any other output to somewhere it can be discarded.
To filter out 404 Not Found we can use grep -v which will match lines that do not contain 404, and -q makes grep be quiet. This way we can test only on the exit value of grep.
If you combine two commands with &&, the last one will only be run if the previous command was successful - so by putting && echo {} after grep, the URL will only be printed if grep was successful. Remember that {} is replaced with the URL by xargs!
Finally, we send the list of URLs to a new file, and we're done!
Happy hacking,
Vetle
Top comments (0)