DEV Community

Mohan Ganesan
Mohan Ganesan

Posted on • Originally published at proxiesapi.com

Scraping Reddit with Python and Beautiful Soup

Today we are going to see how we can scrape Reddit posts using Python and BeautifulSoup in a simple and elegant manner.
The aim of this article is to get you started on a real-world problem solving while keeping it super simple so you get familiar and get practical results as fast as possible.

So the first thing we need is to make sure we have Python 3 installed. If not, you can just get Python 3 and get it installed before you proceed.

Then you can install beautiful soup with...
Today we are going to see how we can scrape Reddit posts using Python and BeautifulSoup in a simple and elegant manner.
The aim of this article is to get you started on a real-world problem solving while keeping it super simple so you get familiar and get practical results as fast as possible.

So the first thing we need is to make sure we have Python 3 installed. If not, you can just get Python 3 and get it installed before you proceed.

Then you can install beautiful soup with...
Once installed open an editor and type in.
Now let's go to the programming subreddit and inspect the data we can get.

This is how it looks:
Back to our code now... Let's try and get this data by pretending we are a browser like this.
Save this as reddit_bs.py.

If you run it.
You will see the whole HTML page.

Now, let's use CSS selectors to get to the data we want. To do that let's go back to Chrome and open the inspect tool. You can see that all the post title elements have a class called review-title in them.

Let's use CSS selectors to get this data like so.
This will print the title of the first post. We now need to get to all the posts. We notice that the class 'Post' (amongst others) holds all the individual data together.
To get to them individually we run through them like this and try and get to the post title from 'inside' the 'Post'
Bingo!! we got the post titles.

Now with the same process, we get the class names of all the other data like post votes, a number of comments, a link to it, etc.
That when run, should print everything we need from each post like this.

If you want to use this in production and want to scale to thousands of links then you will find that you will get IP blocked easily by Reddit. In this scenario using a rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,
With our automatic IP rotation
With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
With our automatic CAPTCHA solving technology,
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

Top comments (1)

Collapse
 
kerldev profile image
Kyle Jones • Edited

One consideration when doing this is duplication. Reddit has a pretty big cross-posting culture and you'd likely not want similar/identical posts being caught by your scraper.
Some decent ways around this would be to store hashes of post titles or by using something like a MinHash or SimHash.
Implementing something like this can significantly reduce the storage used.