Building a Web Crawler? Here Are All The Places That It Will Probably Fail At

Here is a list of places that your web crawler will probably fail at. You will need to build in checks for each and also expect them to happen. Send yourself alerts by having portions of the scripts check for unexpected behavior.

If your web crawler is stuck, you need to know
If your web crawler is slowing down, you need to know
if you are having internet issues, you need to know
if the data you are getting is weird, you need to know
Use can also use external tools like these to help keep them up.

Log all the steps your web crawler is taking and the time it took for each. Build in a check where your code sends you an alert when the time has taken is too long and if it 'knows' the data that should be fetched, but it is not fetched this time.

Here is a list of places that you need to pay special attention to in your code to prevent breakages

We the web pages dont load
Internet is down
When the content at the URL has moved
You are shown a CAPTCHA challenge.
The web page changes its HTML, so your scraping doesn't work.
Some fields that you scrape are empty some of the time, and there is no handler for that.
The web pages take a long time to load
The web site has blocked you completely

DEV Community

Building a Web Crawler? Here Are All The Places That It Will Probably Fail At

Top comments (0)