Wake Up. Your Web Crawler is Down

If you have ever written a web crawler, you will find that it is one of the most bafflingly difficult programs to write. And as a beginner, it's almost a guarantee that we will make several mistakes in the process of building one.

Initially, we think building a crawler is about building the code. The code works on an example website. It's able to crawl, scrape, and store data. What could go wrong?

Well, it turns out its many things.

I remember when I wrote my first crawler. It was in PHP. It would use CURL requests to download pages, then scrape them using Regex, paginate and store the data in MySQL.

And once we deployed it, this became the theme of my life.

Wake up. The Web Crawler is down
And it was always something new.

The way I had coded it if the code broke while crawling a page, the whole process broke down. The rest of the crawl would stop, and so would the scraping. I also had no way of knowing how many URLs I had finished crawling and whether they were successfully fetched and also if they were successfully scraped. I had no way to resume where I left off. I had not heard of Robots.txt. I don't know they I could use Asynchronous requests to download URLs concurrently. I had no rules that I had set about not following external links. Once I did manage to write the code for it, it would not fetch the CDN images because it was slightly different. My code was so complex, and I was at my wit's end. So I would hard code many things particular to a website into the code. There was a separate project for each website I had to scrape.

I didn't know that there were frameworks where most of the heavy lifting was already done that I could use. The code was working fine in my little setup. But out there in the wild, it failed at the drop of a hat.

The website was unreachable - My crawler would break.

The website changed the HTML patterns. My Regex would break.

The website threw out a CAPTCHA challenge - my crawler would have a meltdown.

Websites would simply block me all the time, and I would restart my router to get a new IP and connect every time.

Writing a web crawler is one of the most fun jobs in programming but also one of the most difficult.

Eventually, al these frustrations lead to a lot of learning, and we developed tools like Proxies API and crawltohell to help people overcome as many of these problems as simply and as easily as possible.