Web scraping, efficiently!

Navon Francis — Tue, 10 Jul 2018 12:56:23 +0000

Web scraping is a useful (and super cool) way to access data you may need for your application. Sometimes you may want something really specific that it is not provided in an API or database. Today, we're going tweak this and make it even cooler by using some asynchronous libraries in python to make it even faster!

We will be using the following:

Chrome dev tools to inspect html elements
Python 3 just cuz
BeautifulSoup 4 for scraping
grequests for asynchronous requests

psst... you may need some other libraries. Since I am using python 3, I used pip 3 to install anything miscellaneous that I needed, such as lxml.

pip3 install BeautifulSoup4 grequests lxml

So lets go over a couple things first..

The Motivation

So you're browsing through dev.to, your favorite website, and you are thinking;

"You know what would be so cool, is figuring out the header information of an article."

"But I don't just read one, I want multiple articles because I love to read many articles at one time."

So tags, article name, user, and date. I know this guy so he's ok with me advertising his article like this ;)

Build a quick Summarizer with Python and NLTK

David Israwi ・ Aug 17 '17

#python #nlp #dataanalytics #learning

and then you think,

"Okay! I think i've made my life hard enough now"

The Problem

To scrape something, you basically make a request to a url, the request contains all the html to that page, with this you can use a parsing library like BeautifulSoup 4, parse through and find the element you want (such as an h1 tag), and extract the data that you need (a title in that h1 tag).

BUT, we're going to take it one step further and make this even cooler. We will use asynchronous requests to be able to do this much faster.

A good analogy of an async process is that imagine you are late for work so you pop in a waffle in the toaster (start request 1), after you press start, you go ahead and start brushing your teeth (start request 2) whilst waiting for your waffle to be finished. Boom, async waffles.

Let's get started!

Imports

from bs4 import BeautifulSoup

import grequests
import requests
import time

These are the libraries you will need to import. BeautifulSoup for parsing, grequests is the library that will let us make async requests, requests is just for example purposes - we will demonstrate a slower version (not needed), and time is for measuring purposes (not needed). BTW, if you're putting in all of these imports make sure grequests comes before requests, like above, or else you will have a really nasty error that might take you 45 minutes to figure out.

Scraping

links = ['https://dev.to/davidisrawi/build-a-quick-summarizer-with-python-and-nltk',
    'https://dev.to/kbk0125/web-servers-explained-by-running-a-microbrewery-48ie',
    'https://dev.to/dan_abramov/react-beginner-question-thread--1i5e',
    'https://dev.to/ben/im-ben-and-i-am-a-rails-developer-1j67',
    'https://dev.to/devteam/changelog-mentor-matchmaking-3bl0'
]

So I chose 5 articles that are pretty cool. These are the urls that we will be extracting the html from by making a request to them. Let's build a bone-dry scraper with no async requests.

We don't want to make too many requests to dev.to's servers, let's be respectful :)

So, we will need to loop through the links list and make a simple request.

for link in links:
    req = requests.get(link)

After we make our request, we want to create a BeautifulSoup object. This will allow us to call really useful functions like .find() to easily extract what we want.

calling .text on a request will dump all the html for that page, try it out!

soup = BeautifulSoup(req.text, 'lxml')

Now with our soup object we can now call .find(), to retrieve the title of the current page. In our .find() you can see we are specifying an h1 tag with a class of medium, and calling .string on everything. What does all this mean?

# article
print(soup.find('h1', class_='medium').string.lstrip().rstrip())

To pinpoint a specific element (in this case a title), just inspect the page, use the option that lets you select an html element by clicking on it, choose the element, and then it will highlight where in the html template the element is located.

This will allow us to grab the attributes of that element, to specify for our .find(). The .string lets us get everything between the tags, like this:

<h1>Calling.stringGrabsThis</h1>

lstrip() and rstrip() removes leading and trailing whitespace.

The process is the same with name, date, and tags.

for link in links:
    req = requests.get(link)
    soup = BeautifulSoup(req.text, 'lxml')

    # article
    print(soup.find('h1', class_='medium').string.lstrip().rstrip())

    # name
    print(soup.find('span', itemprop="name").string)

    # date
    print(soup.find('span', class_="published-at").string)

    # tags
    tags = list(map(lambda x: x.string, soup.find_all('a', class_='tag')))
    print(tags, "\n")

Although, we did some fancy things with retrieving tags. Since there are multiple tags, we call .find_all(), because each article tag was in it's own span element. Using a map we then use a lambda function to strip the string from the tag, just like above. Then, we just throw them in a list.

This is what we get:

Build a quick Summarizer with Python and NLTK
David Israwi
Aug 17 '17
['#python', '#nlp', '#dataanalytics', '#learning'] 

Web Servers Explained by Running a Microbrewery
Kevin Kononenko
May 17
['#tutorial', '#beginners', '#webdev'] 

React Beginner Question Thread ⚛
Dan Abramov
Dec 24 '17
['#react', '#javascript', '#beginners', '#webdev'] 

I’m Ben and I am a Rails developer
Ben Halpern
Apr 17
['#ruby', '#rails', '#productivity', '#career'] 

Changelog: Mentor Matchmaking!
Jess Lee
Jul  6
['#changelog', '#meta', '#mentorship'] 

Took 1.877747106552124 seconds

Now that we understand how scraping works the traditional way, we can edit this to adapt to async requests. Instead of iterating through the links list and making the requests inside our loop, we will instead create a response object that houses are requests. Let's see how this works.

Using the grequests library we can create a list of unsent requests.

reqs = [grequests.get(link) for link in links]

We can then use map to fire off these requests and store them in a response object. We use map because it allows us to execute the next request without waiting for the current one to finish.

resp = grequests.map(reqs)

Now, instead of looping through the links, we've already made our response object which houses our requests, so we will iterate through this instead.

for r in resp:
    # we've already made our request, now just parse!

So our final async implementation should look like.. All the scraping should stay the same!

reqs = [grequests.get(link) for link in links]
resp = grequests.map(reqs)

for r in resp:
    soup = BeautifulSoup(r.text, 'lxml')

    # article
    print(soup.find('h1', class_='medium').string.lstrip().rstrip())

    # name
    print(soup.find('span', itemprop="name").string)

    # date
    print(soup.find('span', class_="published-at").string)

    # tags
    tags = list(map(lambda x: x.string, soup.find_all('a', class_='tag')))
    print(tags, "\n")

Final output:

Build a quick Summarizer with Python and NLTK
David Israwi
Aug 17 '17
['#python', '#nlp', '#dataanalytics', '#learning'] 

Web Servers Explained by Running a Microbrewery
Kevin Kononenko
May 17
['#tutorial', '#beginners', '#webdev'] 

React Beginner Question Thread ⚛
Dan Abramov
Dec 24 '17
['#react', '#javascript', '#beginners', '#webdev'] 

I’m Ben and I am a Rails developer
Ben Halpern
Apr 17
['#ruby', '#rails', '#productivity', '#career'] 

Changelog: Mentor Matchmaking!
Jess Lee
Jul  6
['#changelog', '#meta', '#mentorship'] 

Took 0.8569440841674805 seconds

This is a huge time saver. The first one averaged about ~1.5 - ~1.8 seconds and the second averaged ~0.8 - ~1.0 seconds. We can safely say we are saving about ~.1 - ~.2 seconds per request. This may not seem huge at first, but that's just 5 requests! When making many async calls, you will be saving a ton of time. Test this out on your own, each application may vary. You could fire a ton of requests, 20 at a time or 5 at a time, experiment with it to find out what best suits your application needs!

If you want to save even more time, look into SoupStrainer for BeautifulSoup, an object you can apply to your .find() to narrow your search. It may not yield gains like above, but it's a great tool to help keep your code efficient! maybe that's an article for another day

I hope you enjoyed this little guide to scraping! Feel free to drop a comment or reach out to me for any problems or questions you may have. I hope this helps!

peace out