Scraping Washington Post with Python and Beautiful Soup

Today we are going to see how we can scrape Washington Post articles using Python and BeautifulSoup in a simple and elegant manner.

The aim of this article is to get you started on a real-world problem solving while keeping it super simple so you get familiar and get practical results as fast as possible.

So the first thing we need is to make sure we have Python 3 installed. If not, you can just get Python 3 and get it installed before you proceed.

Then you can install beautiful soup with...
We will also need the libraries requests, lxml and soupsieve to fetch data, break it down to XML, and to use CSS selectors. Install them using...
Once installed open an editor and type in...

Now let's go to the Washington Post home page and inspect the data we can get.

Back to our code now. Let's try and get this data by pretending we are a browser like this...

If you run it..

Now let's use CSS selectors to get to the data we want... To do that let's go back to Chrome and open the inspect tool. We now need to get to all the articles... We notice that the with the class '.pb-layout-item.pb-f-homepage-story-ans' holds all the individual articles together.
If you notice, that the article title is contained in an element inside the assetWrapper class. We can get to it like this.
We can get to it like this.

This selects all the pb-layout-item article blocks and runs through them looking for the element and printing its text.
So when you run it you get.

Bingo!! we got the article titles...

Now with the same process, we get the class names of all the other data like article link and article summary..
That was fun.
If you want to use this in production and want to scale to thousands of links then you will find that you will get IP blocked easily by the Washington Post. In this scenario using a rotating proxy service to rotate IPs is almost a must.
Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,
With our automatic IP rotation
With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
With our automatic CAPTCHA solving technology,
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

DEV Community

Scraping Washington Post with Python and Beautiful Soup

Top comments (0)