Web scraping projects are known to fail a lot. So we thought it is more appropriate to have a list of DON'Ts rather than a list of DO's. So here goes.
If the crawler depends on any external data or event to happen in a particular way, Don't assume it will happen like that. It won't MORE often than it does. For example: when fetching a URL, it could break because of timeouts, redirects, CAPTCHA challenges, IP blocks, etc.
DON'T build custom code. Use a framework like scrapy.
DON'T be too aggressive on a website. Check the response time of the website first. In fact, at crawltohell.com, our crawlers adjust their concurrency depending on the response time of the domain, so we don't burden their servers too much.
DON'T write linear code. Don't write code that crawls, scrapes data, processes it, and stores it all in one linear process. If one breaks, so will the others, and also, you would be able to measure and optimize the performance of each process. Batch them instead.
DON'T depends on your IP's. They will eventually get blocked. Always build in the ability to proxy your requests through a Rotating Proxy Service like Proxies API.
Top comments (0)