DEV Community

Navon Francis
Navon Francis

Posted on • Edited on

Web scraping, efficiently!

Web scraping is a useful (and super cool) way to access data you may need for your application. Sometimes you may want something really specific that it is not provided in an API or database. Today, we're going tweak this and make it even cooler by using some asynchronous libraries in python to make it even faster!

img

We will be using the following:

  • Chrome dev tools to inspect html elements
  • Python 3 just cuz
  • BeautifulSoup 4 for scraping
  • grequests for asynchronous requests

psst... you may need some other libraries. Since I am using python 3, I used pip 3 to install anything miscellaneous that I needed, such as lxml.

pip3 install BeautifulSoup4 grequests lxml
Enter fullscreen mode Exit fullscreen mode

So lets go over a couple things first..

The Motivation

So you're browsing through dev.to, your favorite website, and you are thinking;

"You know what would be so cool, is figuring out the header information of an article."

"But I don't just read one, I want multiple articles because I love to read many articles at one time."

So tags, article name, user, and date. I know this guy so he's ok with me advertising his article like this ;)

and then you think,

"Okay! I think i've made my life hard enough now"

The Problem

To scrape something, you basically make a request to a url, the request contains all the html to that page, with this you can use a parsing library like BeautifulSoup 4, parse through and find the element you want (such as an h1 tag), and extract the data that you need (a title in that h1 tag).

BUT, we're going to take it one step further and make this even cooler. We will use asynchronous requests to be able to do this much faster.

A good analogy of an async process is that imagine you are late for work so you pop in a waffle in the toaster (start request 1), after you press start, you go ahead and start brushing your teeth (start request 2) whilst waiting for your waffle to be finished. Boom, async waffles.

img

Let's get started!

Imports

from bs4 import BeautifulSoup

import grequests
import requests
import time
Enter fullscreen mode Exit fullscreen mode

These are the libraries you will need to import. BeautifulSoup for parsing, grequests is the library that will let us make async requests, requests is just for example purposes - we will demonstrate a slower version (not needed), and time is for measuring purposes (not needed). BTW, if you're putting in all of these imports make sure grequests comes before requests, like above, or else you will have a really nasty error that might take you 45 minutes to figure out.

Scraping

links = ['https://dev.to/davidisrawi/build-a-quick-summarizer-with-python-and-nltk',
    'https://dev.to/kbk0125/web-servers-explained-by-running-a-microbrewery-48ie',
    'https://dev.to/dan_abramov/react-beginner-question-thread--1i5e',
    'https://dev.to/ben/im-ben-and-i-am-a-rails-developer-1j67',
    'https://dev.to/devteam/changelog-mentor-matchmaking-3bl0'
]
Enter fullscreen mode Exit fullscreen mode

So I chose 5 articles that are pretty cool. These are the urls that we will be extracting the html from by making a request to them. Let's build a bone-dry scraper with no async requests.

We don't want to make too many requests to dev.to's servers, let's be respectful :)

So, we will need to loop through the links list and make a simple request.

for link in links:
    req = requests.get(link)
Enter fullscreen mode Exit fullscreen mode

After we make our request, we want to create a BeautifulSoup object. This will allow us to call really useful functions like .find() to easily extract what we want.

calling .text on a request will dump all the html for that page, try it out!

soup = BeautifulSoup(req.text, 'lxml')
Enter fullscreen mode Exit fullscreen mode

Now with our soup object we can now call .find(), to retrieve the title of the current page. In our .find() you can see we are specifying an h1 tag with a class of medium, and calling .string on everything. What does all this mean?

# article
print(soup.find('h1', class_='medium').string.lstrip().rstrip())
Enter fullscreen mode Exit fullscreen mode

To pinpoint a specific element (in this case a title), just inspect the page, use the option that lets you select an html element by clicking on it, choose the element, and then it will highlight where in the html template the element is located.

This will allow us to grab the attributes of that element, to specify for our .find(). The .string lets us get everything between the tags, like this:

<h1>Calling.stringGrabsThis</h1> 
Enter fullscreen mode Exit fullscreen mode

lstrip() and rstrip() removes leading and trailing whitespace.

img

The process is the same with name, date, and tags.

for link in links:
    req = requests.get(link)
    soup = BeautifulSoup(req.text, 'lxml')

    # article
    print(soup.find('h1', class_='medium').string.lstrip().rstrip())

    # name
    print(soup.find('span', itemprop="name").string)

    # date
    print(soup.find('span', class_="published-at").string)

    # tags
    tags = list(map(lambda x: x.string, soup.find_all('a', class_='tag')))
    print(tags, "\n")
Enter fullscreen mode Exit fullscreen mode

Although, we did some fancy things with retrieving tags. Since there are multiple tags, we call .find_all(), because each article tag was in it's own span element. Using a map we then use a lambda function to strip the string from the tag, just like above. Then, we just throw them in a list.

This is what we get:

Build a quick Summarizer with Python and NLTK
David Israwi
Aug 17 '17
['#python', '#nlp', '#dataanalytics', '#learning'] 

Web Servers Explained by Running a Microbrewery
Kevin Kononenko
May 17
['#tutorial', '#beginners', '#webdev'] 

React Beginner Question Thread ⚛
Dan Abramov
Dec 24 '17
['#react', '#javascript', '#beginners', '#webdev'] 

I’m Ben and I am a Rails developer
Ben Halpern
Apr 17
['#ruby', '#rails', '#productivity', '#career'] 

Changelog: Mentor Matchmaking!
Jess Lee
Jul  6
['#changelog', '#meta', '#mentorship'] 

Took 1.877747106552124 seconds
Enter fullscreen mode Exit fullscreen mode

Now that we understand how scraping works the traditional way, we can edit this to adapt to async requests. Instead of iterating through the links list and making the requests inside our loop, we will instead create a response object that houses are requests. Let's see how this works.

Using the grequests library we can create a list of unsent requests.

reqs = [grequests.get(link) for link in links]
Enter fullscreen mode Exit fullscreen mode

We can then use map to fire off these requests and store them in a response object. We use map because it allows us to execute the next request without waiting for the current one to finish.

resp = grequests.map(reqs)
Enter fullscreen mode Exit fullscreen mode

Now, instead of looping through the links, we've already made our response object which houses our requests, so we will iterate through this instead.

for r in resp:
    # we've already made our request, now just parse!
Enter fullscreen mode Exit fullscreen mode

So our final async implementation should look like.. All the scraping should stay the same!

reqs = [grequests.get(link) for link in links]
resp = grequests.map(reqs)

for r in resp:
    soup = BeautifulSoup(r.text, 'lxml')

    # article
    print(soup.find('h1', class_='medium').string.lstrip().rstrip())

    # name
    print(soup.find('span', itemprop="name").string)

    # date
    print(soup.find('span', class_="published-at").string)

    # tags
    tags = list(map(lambda x: x.string, soup.find_all('a', class_='tag')))
    print(tags, "\n")
Enter fullscreen mode Exit fullscreen mode

Final output:

Build a quick Summarizer with Python and NLTK
David Israwi
Aug 17 '17
['#python', '#nlp', '#dataanalytics', '#learning'] 

Web Servers Explained by Running a Microbrewery
Kevin Kononenko
May 17
['#tutorial', '#beginners', '#webdev'] 

React Beginner Question Thread ⚛
Dan Abramov
Dec 24 '17
['#react', '#javascript', '#beginners', '#webdev'] 

I’m Ben and I am a Rails developer
Ben Halpern
Apr 17
['#ruby', '#rails', '#productivity', '#career'] 

Changelog: Mentor Matchmaking!
Jess Lee
Jul  6
['#changelog', '#meta', '#mentorship'] 

Took 0.8569440841674805 seconds
Enter fullscreen mode Exit fullscreen mode

This is a huge time saver. The first one averaged about ~1.5 - ~1.8 seconds and the second averaged ~0.8 - ~1.0 seconds. We can safely say we are saving about ~.1 - ~.2 seconds per request. This may not seem huge at first, but that's just 5 requests! When making many async calls, you will be saving a ton of time. Test this out on your own, each application may vary. You could fire a ton of requests, 20 at a time or 5 at a time, experiment with it to find out what best suits your application needs!

If you want to save even more time, look into SoupStrainer for BeautifulSoup, an object you can apply to your .find() to narrow your search. It may not yield gains like above, but it's a great tool to help keep your code efficient! maybe that's an article for another day

I hope you enjoyed this little guide to scraping! Feel free to drop a comment or reach out to me for any problems or questions you may have. I hope this helps!

peace out

Top comments (6)

Collapse
 
kubistmi profile image
kubistmi • Edited

Hello Navon,
first of all, thank you for a very nice and concise article on python web scraping. It is all that the beginners can ask for when first encountering this topic.

Secondly, it seems to me that there is a minor typo in your code (repeated a few times). In the 'Scraping' part, you firstly use variable req to keep the HTML content but then use variable r for HTML parsing, as follows:

for link in links:
    req = requests.get(link)                  # variable req
    soup = BeautifulSoup(r.text, 'lxml')      # variable r

Cheers,
Michal

Collapse
 
navonf profile image
Navon Francis

Thank you so much for the awesome feedback Michal! Also, I really appreciate your help in finding that typo :) I will update accordingly

Collapse
 
sammathew000 profile image
Sam Mathew • Edited

Really helpful. And always better to rotate the user agent and proxy. YOu can use proxies like Bright Data
brightdata.grsm.io/createaccount

Collapse
 
bz38 profile image
(Mark) Boyang Zhao

You won't have enough memory when you have to make thousands or millions of requests. You can use hooks to solve that.

Collapse
 
mehdioss profile image
Mehdi Medrani

Well written article!
Thank you so much!!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.