Web Scraping: Unleashing Insights from Online Data.

#webscraping #python #datascience

Web Scraping: Unleashing Insights from Online Data.

Some background info:

Data Collection is a Data Process and it involves collecting relevant data from various sources, to be used for analysis or for building models. One of the methods of data collection is Web Scraping which you will be able to do by the end of this tutorial.

What is Web Scraping?

Web Scraping is basically a technique used to collect data from the internet. It is an automated process of collecting large unstructured data from websites.

When do you need Web Scraping?

Under normal circumstances, you would ideally use an API to fetch content from a target website but this may not always be the case because not every website has an API through which you can fetch data. In such a scenario where an API is non-existent, you may have to turn to Web Scraping to get the content you want from a website.

Some of the reasons of scraping web pages include:

Collecting data to be used for analysis or building Machine Learning Models
Customer sentiment analysis through reviews on products for example on ecommerce websites.
Competitor analysis and product price comparison.

Take note that you may come across some websites that explicitly forbid the use of automated web scraping tools and therefore always check your target website’s acceptable use policy to ensure you do not violate its terms of use.

How to web-scrape.

You are going to use beautifulsoup, which is a dedicated web scraping Python library.

A Web scraper uses hypertext transfer protocol (HTTP) to request data from a target website using the GET method.

You will GET the HTML from your target URL using the requests library in Python. You will then pass the content that will be returned into beautifulsoup and then use selectors to query the specific data you want from the website.

Prerequisites.

Have python3 installed.
An IDE to run your code, for example VS Code.
After making sure that you have python3 installed on your computer, you will need to install the necessary libraries for building the web scraper. These libraries are: pandas, beautifulsoup and requests.
Have a copy of the URL to your target website. For my case I will scrape data from backmarket.com.

Install the required libraries:

Import the required libraries into your code:

Initialize an empty list to store the data you will get from the target website:

Inspect the target Website:

Go to your target website and inspect it by either right clicking and going to inspect or using the shortcut ctrl+shift+i on Windows or command+shift+i on Mac.

Navigate to the specific HTML code of a product card of your choice, preferably the first one in the page for easier navigation. You can see the product being overlaid by a blue filter as you continue collapsing the html code. Continue collapsing the HTML code until you get to the target data, for my case I want the name of the phone, the price and the status.

Here is the collapsed data of my product card. I will select the class inside the div with the name, storage and status which will query the specific data as you will see in the full code down below.

Here is the complete code that will essentially send a GET request to the target URL, query specific data and store the data in a CSV file.

What specifically is the code doing?

As you had seen earlier, the first three lines of code are the necessary imports you need for the web scraper.

You then define an empty list where the data gotten from the target website will be stored.

There is a for loop that will execute in all the 13 pages of my target website. The page numbers were indicated at the bottom of the page in my case, so I just specified them as they were. This may not always be the case as sometimes you may be scraping hundreds of pages and their number may not be explicitly stated.

I am getting the name, price and status of the phone for each product in my target website, which is Samsung phones.

I then created a list called info and assigned it three elements: Name, Price and Status and appended the info list to the “data” list defined in the code after the imports.

Finally using pandas, put the data into a dataframe and convert the dataframe to CSV format and it should download in the directory where the file containing your code is located.

Conclusion.

There you have it! You can now scrape a target website for specific data you may need, maybe for data analysis. In my case I managed to scrape 892 products, did Exploratory Data Analysis and zeroed down on the specific phone I wanted. This saved me the time that I would have otherwise taken to scroll through all the pages and look over all the 892 Samsung phones listed.