DEV Community: TeraCrawler

Create Web Crawlers That Don't Die on You

TeraCrawler — Mon, 02 Nov 2020 17:24:25 +0000

Here are some rules of thumb to follow when building web crawlers that can scale.

De-couple the web crawling and the web scraping process. This is because you can then measure and speed up the performance of each of these processes separately.
Do not use Regex for scraping. Use XPath or CSS selectors instead. Regex will be the first to break when the target web page's HTML changes even a little bit.
Assume every external dependant process or method will fail and write handlers and loggers for each. For example, assume the URL fetch will fail, timeout, redirect, return empty, or show a CAPTCHA. Anticipate and log each of these exceptions.
Make it easy to debug your app by tracing all the steps your crawler goes through. Make the logger as rich as possible. Send yourself alerts so you immediately know if there is something wrong.
Learn how to get your crawlers to pretend to be human.
Build a crawler around a framework. Custom code will have a bunch more points of failure.

Do you make these 9 mistakes while web crawling?

TeraCrawler — Mon, 02 Nov 2020 17:23:16 +0000

Here is a bunch of things that can get you in trouble while web crawling:

Not respecting Robots.txt.
Not using Asynchronous connections to speed up crawling.
Not use CSS selectors or XPath to reliably scrape data.
Not use a user-agent string.
Not rotate user-agent strings.
Not add a random delay between requests to the same domain.
Not use a framework.
Not monitor the progress of your crawlers.
Not use a rotating proxy service like Proxies API.
Being smart about web crawling is realizing that it's not about the code. In our experience at Teracrawler developing cloud-based web crawlers at scale, most of the web crawling and web scraping is about controlling these variables. Having a systematic approach to web crawling and getting to a place where you can get frequent and reliable data and scale day in and day out can change the fortunes of your company.

Fix Your Web Scrapers with This 15 Point Checklist

TeraCrawler — Tue, 27 Oct 2020 13:09:46 +0000

Web scrapers are known to die on us. It's because so much is dependant on things on the internet we cant control. We at Proxies API always say if you want to understand the internet build a web crawler.web scraping tools, web scraping, web scraping api, best web scraping tools, web scraping tools open source

Lead Generation Through Web Scraping

TeraCrawler — Wed, 21 Oct 2020 10:30:33 +0000

Web scraping is one of the best ways to generate leads for your business quite easily.

Today we will look at a practical example of how to get leads from online business directories.

Here are 57 of them that you can scrape leads from

That's a lot. So in this article lets learn how to scrape one of them, Yellow pages so that you can use the same techniques to scrape data from the others. Here we are imagining a scenario where we are looking to generate a list of Dentists that we can target.

BeautifulSoup will help us extract information, and we will retrieve crucial pieces of information from Yellow Pages.

To start with, this is the boilerplate code we need to get the Yellowpages.com search results page and set up BeautifulSoup to help us use CSS selectors to query the page for meaningful data.

We are also passing the user agent headers to simulate a browser call, so we don't get blocked.

Now let's analyse the Yellow pages search results. This is how it looks.

And when we inspect the page, we find that each of the items HTML is encapsulated in a tag with the class v-card

We could just use this to break the HTML document into these cards which contain individual item information like this.

And when you run it.

You can tell that the code is isolating the v-cards HTML

On further inspection, you can see that the name of the place always has the class business-name. So let's try and retrieve that.

That will get us the names.

Bingo!

Now let's get the other data pieces.

And when run.

Produces all the info we need including ratings, reviews, phone, address etc

If you want to use this in production and want to scale to thousands of links, then you will find that you will get IP blocked easily by Yellow Pages. In this scenario, using a rotating proxy service to rotate IPs is almost a must. You can use a service like Proxies API to route your calls through a pool of millions of residential proxies.

If you want to scale the crawling speed and don't want to set up you own infrastructure, you can use our Cloud base crawler crawltohell.com to easily crawl thousands of URLs at high speed from our network of crawlers.

Answer to Last Week's Web Scraping Challenge

TeraCrawler — Sat, 17 Oct 2020 04:57:29 +0000

We at TeraCrawler.io do a lot of web crawling and web scraping. We write a lot about it too here as well as on our blog. Every day is a new challenge. When a new developer joins our team, we throw a few challenges at them to test their ability to think through a quagmire that is web scraping.

So last week we posted a web scraping coding challenge to see if some of you wanted to test yourself against a real-world web scraping problem. Here is our answer to that problem in a step by step manner:

So in the challenge we had to scrape Flipkart data using Python and BeautifulSoup is a simple and elegant manner.

So the first thing we need is to make sure we have Python 3 installed. If not, you can just get Python 3 and get it installed before you proceed.

Then you can install beautiful soup with:

We will also need the libraries requests, lxml, and soupsieve to fetch data, break it down to XML, and to use CSS selectors. Install them using...

Once installed open an editor and type in:

Now let's go to the Flipkart listing page and inspect the data we can get.

This is how it looks:

Back to our code now. Let's try and get this data by pretending we are a browser like this:

Save this as scrapeFlipkart.py.

If you run it:

You will see the whole HTML page.

Now, let's use CSS selectors to get to the data we want. To do that let's go back to Chrome and open the inspect tool.

We notice that all the individual product data are contained in a with the attribute data-id. You also notice that the attribute value is some gibberish and it's always changing. We cant use this. But the clue is the presence of the data-id attribute itself. That's all we need. So let's extract that.

This prints all the content in each of the containers that hold the product data.

The second line above gives us the URL to the listing.

The product rating has a meaningful id productRating followed by some gibberish. But we can use the *= operator to select anything which has the word productRating.

Scraping the price is even more challenging because it has no discernable class name or ID as a clue to get to it. But it always has the currency denominator ₹ in it. So we use regex to find it.

We do the same to get the discount percentage. It always has the word off in it.

Putting it all together.

If you run it it will print out all the details.

Bingo!! We got them all. That was challenging and satisfying.

If you want to use this in production and want to scale to thousands of links then you will find that you will get IP blocked easily by Flipkart. In this scenario using a rotating proxy service to rotate IPs is almost a must. You can use a service like Proxies API to route your calls through a pool of millions of residential proxies.

If you want to scale the crawling speed and dont want to set up your own infrastructure, you can use our Cloud base crawler teracrawler.io to easily crawl thousands of URLs at high speed from our network of crawlers.

New to web scraping? Here is a challenge for you

TeraCrawler — Wed, 14 Oct 2020 17:36:21 +0000

We thought we will expose one such challenge here so you can have fun with it. This will help you rack your brains, force you to research, and think outside the box and in the end you will find you now know 5 new things about scraping than you knew before. Please post your answers as comments to see all the various ways you can approach this problem. Don't worry, we will post our answer in a week's time as a follow-up post.

The scenario:

Suppose you are a phone retailer who wants to keep tabs on all the prices of phones on Flipkart, the eCommerce website in India (there is a reason we picked Flipkart and not a global website like Amazon as you will see) and their prices, ratings and especially, the percentage discounts they are offering.

What you have to do:

You will have to crawl product data from Flipkart.com for the product category phones and extract the following details:

Title of the product, URL, Rating, Price, and discount percentage.

What's challenging about this:

As you will discover, Flipkart happens to have no obvious way to point at the data we want. Generally, websites like Amazon are easily scrapable because of the way the HTML and the CSS classes are meaningfully laid out. You can use CSS selectors to point at data and simply scrape them. None of that will fly with Flipkart because as shown below, the CSS classed are all auto-generated gibberish. Also, from having scraped with the website before, we know you will need to jump through different hoops for different pieces of data on this website. So have at it:

What you will learn

You will learn how to use creative ways to reliably point at data you want. A sort of web scraping parkour if you will. You will fail, fail, fail and then you will learn.

Tools you can use

Ideally, you can use Python and Beautiful soup to do the scraping. We are not strict about the language or the library as long as it gets the job done, but our answers a week from now will using Python and Beautiful Soup.

Enjoy the hunt. See you a week later :-)

Scraping Cars.com Product Details with Python and Beautiful Soup

TeraCrawler — Fri, 09 Oct 2020 05:43:26 +0000

Today we are going to see how we can scrape Cars.com product details using Python and BeautifulSoup in a simple and elegant manner.

The aim of this article is to get you started on a real-world problem solving while keeping it super simple so you get familiar and get practical results as fast as possible.

So the first thing we need is to make sure we have Python 3 installed. If not, you can just get Python 3 and get it installed before you proceed.

Then you can install beautiful soup with:

We will also need the libraries requests, lxml, and soupsieve to fetch data, break it down to XML, and to use CSS selectors. Install them using:

Once installed open an editor and type in:

Now let's go to the Cars.com product details listing page and inspect the data we can get.

This is how it looks:

Scraping Corona Virus data with Python and Beautiful Soup

TeraCrawler — Sun, 04 Oct 2020 16:40:02 +0000

Today we are going to see how we can scrape Corona Virus data using Python and BeautifulSoup in a simple manner.

The aim of this article is to get you started on a real-world problem solving while keeping it super simple so you get familiar and get practical results as fast as possible.

So the first thing we need is to make sure we have Python 3 installed. If not, you can just get Python 3 and get it installed before you proceed.

Then you can install beautiful soup with:

We will also need the libraries requests, lxml, and soupsieve to fetch data, break it down to XML, and to use CSS selectors. Install them using:

Once installed open an editor and type in:

Now let's go to the Corona Virus data listing page at the ECDC website and inspect the data we can get.

This is how it looks:

Back to our code now. Let's try and get this data by pretending we are a browser like this:

If you run it.

You will see the whole HTML page

Now, let's use CSS selectors to get to the data we want. To do that let's go back to Chrome and open the inspect tool.

We notice that all the individual product data are contained in a with rows containing individual country info and the tags containing the specific fields of data.

So we can extract them like this:

If we run it we will get all the data we need.

We got them all.

If you want to use this in production and want to scale to thousands of links then you will find that you will get IP blocked easily by several websites. In this scenario using a rotating proxy service to rotate IPs is almost a must. You can use a service like Proxies API to route your calls through a pool of millions of residential proxies.

If you want to scale the crawling speed and don't want to set up your own infrastructure, you can use our Cloud base crawler crawltohell.com to easily crawl thousands of URLs at high speed from our network of crawlers.

Systematic Web Scraping

TeraCrawler — Wed, 30 Sep 2020 17:22:18 +0000

If it helps to think of a web crawler as a system than a piece of code.

This shift is very important and will be forced on any developer whoever attempts web scraping at scale.

It is one of the best ways to learn thinking in systems.

We can see the whole crawling process as a workflow with multiple possible points of failure. in fact, any place where the scraper is dependant on external resources is a place it could and will fail. So 90% of the time spent by the developer is in fixing in bit and pieces these inevitable issues.

At Proxies API, we have gone through the drudgery of not thinking in a systematic way about web scraping till we one day took a step back and identified the central problem. The code was never the problem. The whole thing didn't work as a system. We finally decided on our own set of rules to make crawlers that work systematically.

Here are the rules that the system has to obey:

Handle fetching issues (timeouts, redirects, headers, browser spoofing, CAPTCHAs and IP blocks)
Where the crawler doesn't have a solution to each of the issues(for example, CAPTCHAs), it should at least handle and log them.
The system should be able to "step over" any issue and not stumble and fail to bring everything down with it.
The system should immediately alert the developer about an issue.
The system should help the developer diagnose the last issue quickly with as much context as possible, so it is easily re-producable
The system should be as generic as possible at the code level and should push individual website logic to an external database as much as possible.
The system should have enough levers to control the speed and scale of the crawl.

Wake Up. Your Web Crawler is Down

TeraCrawler — Sat, 26 Sep 2020 16:17:49 +0000

If you have ever written a web crawler, you will find that it is one of the most bafflingly difficult programs to write. And as a beginner, it's almost a guarantee that we will make several mistakes in the process of building one.

Initially, we think building a crawler is about building the code. The code works on an example website. It's able to crawl, scrape, and store data. What could go wrong?

Well, it turns out its many things.

I remember when I wrote my first crawler. It was in PHP. It would use CURL requests to download pages, then scrape them using Regex, paginate and store the data in MySQL.

And once we deployed it, this became the theme of my life.

Wake up. The Web Crawler is down
And it was always something new.

The way I had coded it if the code broke while crawling a page, the whole process broke down. The rest of the crawl would stop, and so would the scraping. I also had no way of knowing how many URLs I had finished crawling and whether they were successfully fetched and also if they were successfully scraped. I had no way to resume where I left off. I had not heard of Robots.txt. I don't know they I could use Asynchronous requests to download URLs concurrently. I had no rules that I had set about not following external links. Once I did manage to write the code for it, it would not fetch the CDN images because it was slightly different. My code was so complex, and I was at my wit's end. So I would hard code many things particular to a website into the code. There was a separate project for each website I had to scrape.

I didn't know that there were frameworks where most of the heavy lifting was already done that I could use. The code was working fine in my little setup. But out there in the wild, it failed at the drop of a hat.

The website was unreachable - My crawler would break.

The website changed the HTML patterns. My Regex would break.

The website threw out a CAPTCHA challenge - my crawler would have a meltdown.

Websites would simply block me all the time, and I would restart my router to get a new IP and connect every time.

Writing a web crawler is one of the most fun jobs in programming but also one of the most difficult.

Eventually, al these frustrations lead to a lot of learning, and we developed tools like Proxies API and crawltohell to help people overcome as many of these problems as simply and as easily as possible.

Web Crawling an Entire Blog

TeraCrawler — Tue, 22 Sep 2020 09:09:03 +0000

One of the most common applications of web crawling, according to the patterns we see with many of our customers at crawltohell is scraping blog posts. Today lets look at how we can build a simple scraper to pull out and save blog posts from a blog like CopyBlogger.
There are about ten posts on this page. We will try and scrape them all.

First, we need to install scrapy if you haven't already.
Once installed, we will add a simple file with some barebones code.
Let's examine this code before we proceed.

The allowed_domains array restricts all further crawling to the domain paths specified here.

start_urls is the list of URLs to crawl for us; in this example, we only need one URL.

The LOG_LEVEL settings make the scrapy output less verbose, so it is not confusing.

The def parse(self, response): function is called by scrapy after every successful URL crawl. Here is where we can write our code to extract the data we want.

Now let's see what we can write in the parse function.
When we inspect this in the Google Chrome inspect tool (right-click on the page in Chrome and click Inspect to bring it up), we can see that the article headlines are always inside an H2 tag with the CSS class entry-title. This is good enough for us. We can just select this using the CSS selector function.
This will give us the Headline. We also need the href in the 'a' which has the class entry-title-link, so we need to extract this as well.
So lets put this all together.
Let's save it as BlogCrawler.py and then run it with these parameters, which tells scrapy to disobey Robots.txt and also to simulate a web browser.
When you run, it should return.
Those are all the blog posts. Let's save them into files.
When you run it now, it will save all the blog posts into a file folder.

But if you look at it, there are more than 320 odd pages like this on CopyBlogger. We need a way to paginate through to them and fetch them all.
When we inspect this in the Google Chrome inspect tool, we can see that the link is inside an LI element with the CSS class pagination-next. This is good enough for us. We can just select this using the CSS selector function.
This will give us the text 'Next Page,' though. What we need is the href in the 'a' tag inside the LI tag. So we modify it to this.
The moment we have the URL, we can ask Scrapy to fetch the URL contents.
And when you run it, it should download all the blog posts that were ever written on CopyBlogger onto your system.
Scaling Scrapy

The example above is ok for small scale web crawling projects. But if you try to scrape large quantities of data at high speeds, you will find that sooner or later, your access will be restricted. Web servers can tell you are a bot, so one of the things you can do is run the crawler impersonating a web browser. This is done by passing the user agent string to the Wikipedia web server, so it doesn't block you.
In more advanced implementations, you will need to even rotate this string, so Copyblogger can't tell it the same browser! Welcome to web scraping.

If you want to use this in production and want to scale to thousands of links, then you will find that you will get IP blocked easily by Copyblogger. In this scenario, using a rotating proxy service to rotate IPs is almost a must. You can use a service like Proxies API to route your calls through a pool of millions of residential proxies.

If you want to scale the crawling speed and don't want to set up your infrastructure, you can use our Cloud base crawler crawltohell.com to easily crawl thousands of URLs at high speed from our network of crawlers.

Web Scraping Don'ts

TeraCrawler — Fri, 18 Sep 2020 10:25:04 +0000

Web scraping projects are known to fail a lot. So we thought it is more appropriate to have a list of DON'Ts rather than a list of DO's. So here goes.

If the crawler depends on any external data or event to happen in a particular way, Don't assume it will happen like that. It won't MORE often than it does. For example: when fetching a URL, it could break because of timeouts, redirects, CAPTCHA challenges, IP blocks, etc.
DON'T build custom code. Use a framework like scrapy.
DON'T be too aggressive on a website. Check the response time of the website first. In fact, at crawltohell.com, our crawlers adjust their concurrency depending on the response time of the domain, so we don't burden their servers too much.
DON'T write linear code. Don't write code that crawls, scrapes data, processes it, and stores it all in one linear process. If one breaks, so will the others, and also, you would be able to measure and optimize the performance of each process. Batch them instead.
DON'T depends on your IP's. They will eventually get blocked. Always build in the ability to proxy your requests through a Rotating Proxy Service like Proxies API.