DEV Community

TeraCrawler
TeraCrawler

Posted on

Will This Code Work? What's Wrong With Most Web Scraping Code

This code scrapes Airbnb listings and prints them out.
When you save it as scrapeAirbnb.py and run it, it will get your the details.
So is the job done? Finito?

Far from it. If this needs to go into production, at any decent level of scale, you will need all sorts of mechanisms to make sure this can function without breaking.

You will need to handle website timeouts
You will probably need to download images
You will need to pretend to be a web browser very well using User-Agent strings and other techniques
You will need to rotate user-agent strings
You will need to read the Robots.txt and respect it
You will need to send asynchronous requests if you have a lot of URLs to scrape
You may need distributed servers to handle the load if this has multiple domains to be crawled all asynchronously
You will need monitoring, tracking, and alerting mechanism for when the crawler breaks for any reason.
You will need to handle the incoming data at large quantities, detect the finish of a job, send out alerts, and make data available for download or further consumption in various formats like XML, CSV, or JSON.
You may need to handle cookies that the web server sends.
You will need to handle CAPTCHAs and other restrictions that the website will impose after crawling a few hundred URLs
You will need to handle total IP Bans.
The list goes on. Web crawling is amazingly complex, frustrating even in the beginning.

It can be extremely rewarding once you have established a reliable, schedulable, and manageable crawler/scraper setup that has all of the things above finally in place.

Use this as a checklist in your future web crawling projects and comment below if you have other items to add to this.

If you want a cloud-based crawling software that can do all of that and more behind the scenes in a reliable fashion, you can consider using our product TeraCrawler.io for crawling large sets of URLs. For overcoming IP bans, I recommend using our other product Proxies API, which is a rotating proxies API that can route your requests through a pool of over 2 million IPs making IP bans almost impossible.

Top comments (0)