DEV Community

Cover image for Web scraping with Python: A quick guide
Hunter Johnson for Educative

Posted on • Edited on • Originally published at educative.io

Web scraping with Python: A quick guide

The internet is arguably the most abundant data source that you can access today. Crawling through this massive web of information on your own would take a superhuman amount of effort. So, why not build a web scraper to do the detective work for you? Automated web scraping is a great way to collect relevant data across many webpages in a relatively short amount of time.

You may be wondering why we chose Python for this tutorial, and the short answer is that Python is considered one of the best programming languages to use for web scraping. Python libraries like BeautifulSoup and packages like Selenium have made it incredibly easy to get started with your own web scraping project.

We'll introduce you to some basic principles and applications of web scraping. Then, we'll take a closer look at some of the more popular Python tools and libraries used for web scraping before moving on to a quick step-by-step tutorial for building your very own web scraper.
Let's get started!

We'll cover:

  • Overview: Web scraping with Python
  • Build a web scraper with Python
  • Wrapping up and next steps

Overview: Web scraping with Python

As a high-level, interpreted language, Python 3 is one of the easiest languages to read and write because its syntax bears some similarities to the English language. Luckily for us, Python is much easier to learn than English. Python programming is also a great choice in general for anyone who wants to dabble in data sciences, artificial intelligence, machine learning, web applications, image processing, or operating systems.

This section will cover what Python web scraping is, what it can be used for, how it works, and the tools you can use to scrape data.

What is web scraping?

Web scraping is the process of extracting usable data from different webpages to be used for analysis, comparison, and many other purposes. The type of data that can be collected ranges from text, images, ratings, URLs, and more. Web scrapers extract this data by loading a URL and loading the HTML code for that page. Advanced web scrapers are capable of extracting CSS and JavaScript code from the webpage as well.

Believe it or not, web scraping used to be conducted manually by copying and pasting data from webpages into text files and spreadsheets!

Legality

As long as the data you're scraping does not require an account for access, isn't blocked by a robots.txt file, and is publicly available, it's considered fair game.

What's the difference between a web crawler and a web scraper?

A web crawler just collects data (usually to archive or index), while web scrapers look for specific types of data to collect, analyze, and transform.

What can web scraping be used for?

Web scraping has a wide variety of applications. Web developers, digital marketers, data scientists, and journalists regularly use web scraping to collect publicly available data. This data can be transferred to a spreadsheet or JSON file for easy data analysis, or it can be used to create an application programming interface (API). Web scraping is also great for building bots, automating complicated searches, and tracking the prices of goods and services.

Here are some other real-world applications of web scraping:

  • Natural language processing
  • Machine learning
  • Predictive analytics
  • Real-time analytics
  • Price monitoring
  • Financial data aggregation
  • Scraping song lyrics
  • Tracking stock prices
  • Fetching images and product descriptions
  • Consumer sentiment analysis
  • Search Engine Optimization (SEO) monitoring

How does web scraping work?

Web scraping involves three steps:

  • Data collection: In this step, data is collected from webpages (typically with a web crawler)
  • Data parsing and transformation: This next step involves transforming the collected dataset into a format that can be used for further analysis (like a spreadsheet or JSON file)
  • Data storage: The last stage of web scraping involves storing the transformed data in a JSON, XML, or CSV file

What tools and libraries are used to scrape the web?

These are some of the most popular tools and libraries used to scrape the web using Python.

  • Beautiful Soup: A popular Python library used to extract data from HTML and XML files
  • MechanicalSoup: Another Python library used to automate interactions on websites (like submitting forms)
  • Scrapy: A high-speed, open-source web crawling and scraping framework
  • Selenium: A suite of open-source automation tools that provides an API to write acceptance or functional tests
  • Python Requests: The requests library allows users to send HTTP/1.1 requests without needing to attach query strings to URLs or form-encode POST data
  • LXML: A tool used for processing XML and HTML in the Python language
  • Urllib: A package used for opening URLs
  • Pandas: Not typically used for scraping, but useful for data analysis, manipulation, and storage

However, for the purposes of this tutorial, we'll be focusing on just three: Beautiful Soup 4 (BS4), Selenium, and the statistics.py module.

Build a web scraper with Python

Let's say we want to compare the prices of women's jeans on Madewell and NET-A-PORTER to see who has the better price.

For this tutorial, we'll build a web scraper to help us compare the average prices of products offered by two similar online fashion retailers.

Step 1: Select the URLs you want to scrape

For both Madewell and NET-A-PORTER, you'll want to grab the target URL from their webpage for women's jeans.

For Madewell, this URL is:

https://www.madewell.com/womens/clothing/jeans 
Enter fullscreen mode Exit fullscreen mode

For NET-A-PORTER, your URL will be:

https://www.net-a-porter.com/en-us/
Enter fullscreen mode Exit fullscreen mode

Step 2: Find the HTML content that you want to scrape

Once you've selected your URLs, you'll want to figure out what HTML tags or attributes your desired data will be located under. For this step, you'll want to inspect the source of your webpage (or open the Developer Tools Panel).

You can do this with a right-click on the page you're on, and selecting Inspect from the drop-down menu.

Google Chrome Shortcut: Ctrl + Shift + C for Windows or Command + Shift + C for MacOS will let you view the HTML code for this step

In this case, we're looking for the price of jeans. If you look through the HTML document, you'll notice that this information is available under the <span> tag for both Madewell and NET-A-PORTER.

However, using the <span> tag would retrieve too much irrelevant data because it's too generic. We want to narrow down our target when data scraping, and we can get more specific by using attributes inside of the <span> tag instead.

For Madewell, a better HTML attribute would be:

product-sales-price product-usd
Enter fullscreen mode Exit fullscreen mode

For NET-A-PORTER, we'd want to narrow down our target with:

itemprop
Enter fullscreen mode Exit fullscreen mode

Step 3: Choose your tools and libraries

For this task, we will be using the Selenium and Beautiful Soup 4 (BS4) libraries in addition to the statistics.py module. Here's a quick breakdown of why we chose these web scraping tools:

Selenium

Selenium can automatically open a web browser and run tasks in it using a simple script. The Selenium library requires a web browser's driver to be accessed, so we decided to use Google Chrome and downloaded its drivers from here: ChromeDriver Downloads

BS4

We're using BS4 with Python's built-in HTML parser because it's simple and beginner-friendly. A BS4 object gives us access to tools that can scrape any given website through its tags and attributes.

Scrapy is another Python library that would have been suitable for this task, but it's a little more complex than BS4.

statistics.py

The statistics.py module contains methods for calculating mathematical statistics of numeric data.

Method Description
statistics.mean() Calculates the mean (average) of the given data

Step 4: Build your web scraper in Python

4a: Import your libraries

First, you'll want to import statistics, requests, webdriver from selenium, and the beautifulsoup library.

from bs4 import BeautifulSoup
from selenium import webdriver
import requests
import statistics 
Enter fullscreen mode Exit fullscreen mode

4b: Create a Google Chrome driver object

PATH = r'/usercode/chromedriver'
driver = webdriver.Chrome(PATH)
Enter fullscreen mode Exit fullscreen mode

4c: Fetch NET-A-PORTER's website data from their URL using the driver.get method

driver.get("https://www.net-a-porter.com/en-us/shop/clothing/jeans")
Enter fullscreen mode Exit fullscreen mode

4d: Create a beautifulsoup object

Here, we create a beautifulsoup object with the HTML source as driver.page_source and Python's built-in HTML parser, html.parser, as arguments.
This starts the web scraper search for specific tags and attributes.

   soup = BeautifulSoup(driver.page_source),
'html.parser')
   response = soup.find_all("span", {"itemprop" : "price"})
Enter fullscreen mode Exit fullscreen mode

4e: Save the price data into a list, then print it

   data = []
   for item in response: 
       data.append(float(item.text.strip("\n$)))

   print(data)
Enter fullscreen mode Exit fullscreen mode

4f: Print the mean of your data from each website

print(statistics.mean(extracted_data1))
print(statistics.mean(extracted_data2))
Enter fullscreen mode Exit fullscreen mode

Completed code

from bs4 import BeautifulSoup
from selenium import webdriver
import statistics


def shop1():
   PATH = r'/usercode/chromedriver'
   driver = webdriver.Chrome(PATH)
   driver.get("https://www.net-a-porter.com/en-us/shop/clothing/jeans")

   soup = BeautifulSoup(driver.page_source, 'html.parser')
   response = soup.find_all("span", {"itemprop" : "price"})
   data = []

   for item in response:
       data.append(float(item.text.strip('$')))

   print(data)
   return data

def shop2():
   PATH = r'usercode/chromedriver'
   driver = webdriver.Chrome(PATH)
   driver.get("https://www.madewell.com/womens/clothing/jeans")

   soup = BeautifulSoup(driver.page_source, 'html.parser')
   response = soup.find_all("span", "product-sales-price product-usd")

   data = []

   for item in response:
       data.append(float(item.text.strip("\n$")))

   print(data)
   return data

extracted_data1 = shop1()
extracted_data2 = shop2()

print(statistics.mean(extracted_data1))
print(statistics.mean(extracted_data2))
Enter fullscreen mode Exit fullscreen mode

Step 5: Repeat for Madewell

Using the above code, you can repeat the steps for Madewell. As a quick reminder, here are the basic steps you'll need to follow:

  1. Assign the webdriver file path to a path variable
  2. Using Selenium, create a driver object
  3. Get the page source using the driver.get method of the driver object
  4. Make a BS4 object with the HTML source using the driver.pagesource method, and Python's built-in HTML parser, html.parser as arguments
  5. Use the find_all method to extract the data in the tags you identified into a list
  6. Clean the data extracted by using .text and .strip()
  7. Append the data to a list for comparison

Wrapping up and next steps

Congratulations! You've built your first web scraper with Python. By now, you might have a better idea of just how useful web scraping can be, and we encourage you to keep learning more about Python if you want to develop the skills to create your own APIs.

You might not master Python in a single day, but hopefully, this tutorial has helped you realize that Python is much more approachable than you might expect.
To help you master Python, we've created the Predictive Data Analysis with Python course.

Happy learning!

Continue learning about Python on Educative

Start a discussion

Have you worked with web scraping in the past? Was this article helpful? Let us know in the comments below!

Top comments (0)