DEV Community

TeraCrawler
TeraCrawler

Posted on

Web Crawling an Entire Blog

One of the most common applications of web crawling, according to the patterns we see with many of our customers at crawltohell is scraping blog posts. Today lets look at how we can build a simple scraper to pull out and save blog posts from a blog like CopyBlogger.
There are about ten posts on this page. We will try and scrape them all.

First, we need to install scrapy if you haven't already.
Once installed, we will add a simple file with some barebones code.
Let's examine this code before we proceed.

The allowed_domains array restricts all further crawling to the domain paths specified here.

start_urls is the list of URLs to crawl for us; in this example, we only need one URL.

The LOG_LEVEL settings make the scrapy output less verbose, so it is not confusing.

The def parse(self, response): function is called by scrapy after every successful URL crawl. Here is where we can write our code to extract the data we want.

Now let's see what we can write in the parse function.
When we inspect this in the Google Chrome inspect tool (right-click on the page in Chrome and click Inspect to bring it up), we can see that the article headlines are always inside an H2 tag with the CSS class entry-title. This is good enough for us. We can just select this using the CSS selector function.
This will give us the Headline. We also need the href in the 'a' which has the class entry-title-link, so we need to extract this as well.
So lets put this all together.
Let's save it as BlogCrawler.py and then run it with these parameters, which tells scrapy to disobey Robots.txt and also to simulate a web browser.
When you run, it should return.
Those are all the blog posts. Let's save them into files.
When you run it now, it will save all the blog posts into a file folder.

But if you look at it, there are more than 320 odd pages like this on CopyBlogger. We need a way to paginate through to them and fetch them all.
When we inspect this in the Google Chrome inspect tool, we can see that the link is inside an LI element with the CSS class pagination-next. This is good enough for us. We can just select this using the CSS selector function.
This will give us the text 'Next Page,' though. What we need is the href in the 'a' tag inside the LI tag. So we modify it to this.
The moment we have the URL, we can ask Scrapy to fetch the URL contents.
And when you run it, it should download all the blog posts that were ever written on CopyBlogger onto your system.
Scaling Scrapy

The example above is ok for small scale web crawling projects. But if you try to scrape large quantities of data at high speeds, you will find that sooner or later, your access will be restricted. Web servers can tell you are a bot, so one of the things you can do is run the crawler impersonating a web browser. This is done by passing the user agent string to the Wikipedia web server, so it doesn't block you.
In more advanced implementations, you will need to even rotate this string, so Copyblogger can't tell it the same browser! Welcome to web scraping.

If you want to use this in production and want to scale to thousands of links, then you will find that you will get IP blocked easily by Copyblogger. In this scenario, using a rotating proxy service to rotate IPs is almost a must. You can use a service like Proxies API to route your calls through a pool of millions of residential proxies.

If you want to scale the crawling speed and don't want to set up your infrastructure, you can use our Cloud base crawler crawltohell.com to easily crawl thousands of URLs at high speed from our network of crawlers.

Top comments (1)

Collapse
 
crawlbase profile image
Crawlbase

A concise and informative guide to web crawling for blog posts! This breakdown simplifies the process, making it accessible for beginners. For those looking to scale their web crawling endeavors, tools like Crawlbase offer efficient solutions.