Lewis kori

Posted on Aug 23, 2019 • Edited on Sep 17, 2020 • Originally published at lewiskori.com

Introduction to web scraping with python

#python #beginners #tutorial

Originally published on my website

What is web scraping?

This is the process of extracting information from a webpage by taking advantage of patterns in the web page's underlying code.
We can use web scraping to gather unstructured data from the internet, process it, and store it in a structured format.
In this walkthrough, we'll be storing our data in a JSON file.

Alternatives to web scraping

Though web scraping is a useful tool in extracting data from a website,
it's not the only means to achieve this task.
Before starting to web scrape, find out if the page you seek to extract data from provides an API.

robots.txt file

Ensure that you check the robots.txt file of a website before making your scrapper. This file tells if the website allows scraping or if they do not.
To check for the file, simply type the base URL followed by "/robots.txt"
An example is, "mysite.com/robots.txt".
For more about robots.txt files click here.

Beginner's guide to web scraping with python's selenium

Lewis kori ・ Sep 4 '19

#tutorial #beginners #python

Using this scraperapi link and the code
lewis10, you'll get a 10% discount off your first purchase!

Getting started

In this tutorial, we'll be extracting data from books to scrape
which you can use to practice your web scraping.
We'll extract the title, rating, link to more information about the book, and the cover image of the book.

Find the code on github.

1. Importing libraries

import requests
import json
from bs4 import BeautifulSoup

The python libraries perform the following tasks.

requests - will be used to make Http requests to the webpage.
json - we'll use this to store the extracted information to a JSON file.
BeautifulSoup - for parsing HTML.

2. walkthrough

header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'}

base_url = "http://books.toscrape.com/"

r = requests.get(base_url,headers=header)

We're initializing three variables here.

header-HTTP headers provide additional parameters to HTTP transactions. By sending the appropriate HTTP headers, one can access the response data in a different format.
base_url - is the webpage we want to scrape since we'll be needing the URL quite often, it's good to have a single initialization and reuse this variable going forward.
r - this is the response object returned by the get method. Here, we pass the base_url and header as parameters.

if r.status_code == 200:
  soup = BeautifulSoup(r.text, 'html.parser')
  books = soup.find_all('li',attrs={"class":"col-xs-6 col-sm-4 col-md-3 col-lg-3"})
  result=[]
  for book in books:
    title=book.find('h3').text
    link=base_url +book.find('a')['href']
    stars = str(len(book.find_all('i',attrs=  {"class":"icon-star"}))) + " out of 5"
    price="$"+book.find('p',attrs={'class':'price_color'}).text[2:]
    picture = base_url + book.find('img')['src']
    single ={'title':title,'stars':stars,'price':price,'link':link,'picture':picture}
    result.append(single)
    with open('books.json','w') as f:
      json.dump(result,f,indent=4)
else:
  print(r.status_code)

To ensure our scraper runs when the HTTP response is ok we'll use the if statement as a check. The number 200 is the status code for Ok. To get a list of all codes and their meanings check out this resource.
We'll then parse the response object using the BeautifulSoup method and store the new object to a variable called soup.

From the aforementioned definition,

Web scraping the process of extracting information from a webpage by taking advantage of patterns in the web page's underlying code.

Let's take a look at a single record from our webpage to identify the patterns. Once we can see the page, we'll loop through every record on the page as they contain similar traits.

From the image above, we'll notice that all books are contained within a list item with the class

col-xs-6 col-sm-4 col-md-3 col-lg-3

By using the find_all() method, we can find all references of this HTML tag in the webpage. we pass the tag as the first argument and then using the attrs argument which takes in a python dictionary, we can specify attributes of the HTML tag selected. In this case, it was a class indicated above, but you can even use id as an attribute.

Store the result in a variable, I chose the name books.

title=book.find('h3').text
link=base_url +book.find('a')['href']

If we observe keenly, we'll notice that each of the elements we want to extract is nested within the list item tag are all contained in similar tags, in the example above. The title of the book is between h3 tags.
The find() method returns the first matching tag.
text will simply return any text found within the tags specified.
For the anchor tags, we'll be extracting the hyper reference link.
As opposed to the h3 tag, the href element is within anchor tags in HTML. Like so

<a href="somelink.com"></a>

In this case, the returned object will behave like a dictionary where we have a

dictionary_name[key]

We do this iteratively for all the values we seek to extract because we are taking advantage of the pattern in the underlying code of the webpage. Hence the use of the python for loop.

single ={'title':title,'stars':stars,'price':price,'link':link,'picture':picture}
result.append(single)
with open('books.json','w') as f:
    json.dump(result,f,indent=4)

The extracted elements are then stored in respective variables which we'll put in a dictionary. With this information, we can then comfortably append the dictionary object to the initialized result list set before our for loop.

Finally, store the python list in a JSON file by the name "books.json" with an indent of 4 for readability purposes.

With that, you have your simple web scraper up and running. For more on web scrapers, read the documentation for the libraries or on youtube.

If you liked this walkthrough, subscribe to my mailing list to get notified whenever I make new posts.

Thanks.

Top comments (5)

Comment deleted

Lewis kori • Nov 26 '19 • Edited

Thanks for the message.
Your article was really well written and informative. I can honestly say
I've learnt a thing or two I didn't know about scrapy.

This post was actually part of a series. You'll find the scrapy part here though the approach I did was trying to explain how to navigate CAPTCHAs in websites and IP address blocking.

You can check that out as well and share your thougts on any issue.

#ZedZed • Aug 24 '19

I really am liking BeautifulSoup having just recently started learning about it

Lewis kori • Aug 24 '19

It's interesting. Really simplifies web scraping even for beginners.
Continue playing around with it. I am too!
Always learning something new every day

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community