Simple Web scraping project using python and Beautiful soup

#datascience #beginners #python #tutorial

Introduction

Web scraping is an automated method used to extract large amounts of data from websites. The data on the websites are unstructured. Web scraping helps collect these unstructured data and store it in a structured form.

In this project I will show you how to scrape data from a Kenyan website called Jumia https://www.jumia.co.ke/. The data we gather can be used for price comparison.

Website Inspection

The aim of this project is to scrape all products, their prices and rating. So first, we need to inspect the website, this is done by:

1.Visiting this site https://www.jumia.co.ke/all-products/

2.Right clicking and selecting inspect or clicking ctrl+shift+i to inspect the website.

3.Move the cursor around till a product is selected.Then search for the div tag that has the name, price and rating of the product.

Write the code
We start by importing the necessary libraries

from bs4 import BeautifulSoup
import requests

The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us.

jumia = requests.get('https://www.jumia.co.ke/all-products/')

Parsing a page using BeautifulSoup

soup = BeautifulSoup(jumia.content , 'html.parser')
products = jsoup.find_all('div' , class_ = 'info')

Use the find_all method, which will find all the instances of the div tag that has a class called 'info' on the page.

We now extract the name, price and rating.If you want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

Name = product.find('h3' , class_="name").text.replace('\n', '')
Price = product.find('div' , class_= "prc").text.replace('\n', '')
Rating = product.find('div', class_='stars _s').text.replace('\n', '')

replace() is an inbuilt function in the Python programming language that returns a copy of the string where all occurrences of a substring are replaced with another substring.

We can now loop over all products on the page to extract the name, price and rating.

for product in products:
      Name = product.find('h3' , class_="name").text.replace('\n', '')
      Price = product.find('div' , class_= "prc").text.replace('\n', '')
      Rating = product.find('div', class_='stars _s').text.replace('\n', '')

      info = [ Name, Price,Rating]
      print(info)

Note that we are storing all these in a list called info.

Loop over all pages
We have only scraped data from the first page. The site has 50 pages and when you click on the second page you notice that the url changes. So to get the new url we do this:

url = "https://www.jumia.co.ke/all-products/" + "?page=" +str(page)+"#catalog-listing"

That is a simple string concatination. The code to loop through all the pages is:

for page in range(1,51):
  url = "https://www.jumia.co.ke/all-products/" + "?page=" +str(page)+"#catalog-listing"
  furl = requests.get(url)
  jsoup = BeautifulSoup(furl.content , 'html.parser')
  products = jsoup.find_all('div' , class_ = 'info')

  for product in products:
      Name = product.find('h3' , class_="name").text.replace('\n', '')
      Price = product.find('div' , class_= "prc").text.replace('\n', '')
      try:
        Rating = product.find('div', class_='stars _s').text.replace('\n', '')
      except:
        Rating = 'None'

      info = [ Name, Price,Rating]
      print(info)

range() function goes up to but doesn't include the last number. The website has 50 pages this range is up to 51.
Since some of the products have no ratings, we put it between try catch clause and print None in that instance.

Saving to csv

df = pd.DataFrame({'Product Name':Name,'Price':Price,'Rating':Ratings}) 
df.to_csv('products.csv', index=False, encoding='utf-8')

The whole code

from bs4 import BeautifulSoup
import requests

for page in range(1,51):
  url = "https://www.jumia.co.ke/all-products/" + "?page=" +str(page)+"#catalog-listing"
  furl = requests.get(url)
  jsoup = BeautifulSoup(furl.content , 'html.parser')
  products = jsoup.find_all('div' , class_ = 'info')

  for product in products:
      Name = product.find('h3' , class_="name").text.replace('\n', '')
      Price = product.find('div' , class_= "prc").text.replace('\n', '')
      try:
        Rating = product.find('div', class_='stars _s').text.replace('\n', '')
      except:
        Rating = 'None'

      info = [ Name, Price,Rating]
      print(info)