Mohan Ganesan

Posted on Sep 30, 2020 • Edited on Jan 13, 2021 • Originally published at proxiesapi.com

Scraping an Entire Blog with Scrapy

Scrapy is one of the easiest tools that you can use to scrape and also spider a website with effortless ease.

One of the most common applications of web scraping according to the patterns we see with many of our customers at Proxies API is scraping blog posts. Today lets look at how we can build a simple scraper to pull out and save blog posts from a blog like CopyBlogger.

Here is how the CopyBlogger blog section looks.

You can see that there are about 10 posts on this page. We will try and scrape them all.

First, we need to install scrapy if you haven't already.

Once installed, we will add a simple file with some barebones code.

Let's examine this code before we proceed...

The allowed_domains array restricts all further crawling to the domain paths specified here.

start_urls is the list of URLs to crawl... for us, in this example, we only need one URL.

The LOG_LEVEL settings make the scrapy output less verbose so it is not confusing.

The def parse(self, response): function is called by scrapy after every successful URL crawl. Here is where we can write our code to extract the data we want.

Now let's see what we can write in the parse function...

For this let's find the CSS patterns that we can use as selectors for finding the blog posts on this page.

When we inspect this in the Google Chrome inspect tool (right-click on the page in Chrome and click Inspect to bring it up), we can see that the article headlines are always inside an H2 tag with the CSS class entry-title. This is good enough for us. We can just select this using the CSS selector function.

This will give us the Headline. We also need the href in the 'a' which has the class entry-title-link so we need to extract this as well.

So lets put this all together.

Let's save it as BlogCrawler.py and then run it with these parameters which tells scrapy to disobey Robots.txt and also to simulate a web browser.

When you run it should return.

Those are all the blog posts. Let's save them into files.

When you run it now, it will save all the blog posts into a file folder.

But if you really look at it, there are more than 320 odd pages like this on CopyBlogger. We need a way to paginate through to them and fetch them all.

When we inspect this in the Google Chrome inspect tool, we can see that the link is inside an LI element with the CSS class pagination-next. This is good enough for us. We can just select this using the CSS selector function.

This will give us the text 'Next Page' though. What we need is the href in the 'a' tag inside the LI tag.

In fact, the moment we have the URL, we can ask Scrapy to fetch the URL contents.

When you run it, it should download all the blog posts that were ever written on CopyBlogger onto your system.

Scaling Scrapy

The example above is ok for small scale web crawling projects. But if you try to scrape large quantities of data at high speeds you will find that sooner or later your access will be restricted. Web servers can tell you are a bot so one of the things you can do is run the crawler impersonating a web browser. This is done by passing the user agent string to the Wikipedia web server so it doesn't block you.

In more advanced implementations you will need to even rotate this string so Wikipedia cant tell its the same browser! Welcome to web scraping.

If we get a little bit more advanced, you will realize that Wikipedia can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,
With our automatic IP rotation
With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
With our automatic CAPTCHA solving technology,
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

DEV Community

Scraping an Entire Blog with Scrapy

Top comments (0)