The internet is arguably the most abundant data source that you can access today. Crawling through this massive web of information on your own would take a superhuman amount of effort. So, why not build a web scraper to do the detective work for you? Automated web scraping is a great way to collect relevant data across many webpages in a relatively short amount of time.
You may be wondering why we chose Python for this tutorial, and the short answer is that Python is considered one of the best programming languages to use for web scraping. Python libraries like BeautifulSoup and packages like Selenium have made it incredibly easy to get started with your own web scraping project.
We'll introduce you to some basic principles and applications of web scraping. Then, we'll take a closer look at some of the more popular Python tools and libraries used for web scraping before moving on to a quick step-by-step tutorial for building your very own web scraper.
Let's get started!
We'll cover:
- Overview: Web scraping with Python
- Build a web scraper with Python
- Wrapping up and next steps
Overview: Web scraping with Python
As a high-level, interpreted language, Python 3 is one of the easiest languages to read and write because its syntax bears some similarities to the English language. Luckily for us, Python is much easier to learn than English. Python programming is also a great choice in general for anyone who wants to dabble in data sciences, artificial intelligence, machine learning, web applications, image processing, or operating systems.
This section will cover what Python web scraping is, what it can be used for, how it works, and the tools you can use to scrape data.
What is web scraping?
Web scraping is the process of extracting usable data from different webpages to be used for analysis, comparison, and many other purposes. The type of data that can be collected ranges from text, images, ratings, URLs, and more. Web scrapers extract this data by loading a URL and loading the HTML code for that page. Advanced web scrapers are capable of extracting CSS and JavaScript code from the webpage as well.
Believe it or not, web scraping used to be conducted manually by copying and pasting data from webpages into text files and spreadsheets!
Legality
As long as the data you're scraping does not require an account for access, isn't blocked by a robots.txt
file, and is publicly available, it's considered fair game.
What's the difference between a web crawler and a web scraper?
A web crawler just collects data (usually to archive or index), while web scrapers look for specific types of data to collect, analyze, and transform.
What can web scraping be used for?
Web scraping has a wide variety of applications. Web developers, digital marketers, data scientists, and journalists regularly use web scraping to collect publicly available data. This data can be transferred to a spreadsheet or JSON file for easy data analysis, or it can be used to create an application programming interface (API). Web scraping is also great for building bots, automating complicated searches, and tracking the prices of goods and services.
Here are some other real-world applications of web scraping:
- Natural language processing
- Machine learning
- Predictive analytics
- Real-time analytics
- Price monitoring
- Financial data aggregation
- Scraping song lyrics
- Tracking stock prices
- Fetching images and product descriptions
- Consumer sentiment analysis
- Search Engine Optimization (SEO) monitoring
How does web scraping work?
Web scraping involves three steps:
- Data collection: In this step, data is collected from webpages (typically with a web crawler)
- Data parsing and transformation: This next step involves transforming the collected dataset into a format that can be used for further analysis (like a spreadsheet or JSON file)
- Data storage: The last stage of web scraping involves storing the transformed data in a JSON, XML, or CSV file
What tools and libraries are used to scrape the web?
These are some of the most popular tools and libraries used to scrape the web using Python.
- Beautiful Soup: A popular Python library used to extract data from HTML and XML files
- MechanicalSoup: Another Python library used to automate interactions on websites (like submitting forms)
- Scrapy: A high-speed, open-source web crawling and scraping framework
- Selenium: A suite of open-source automation tools that provides an API to write acceptance or functional tests
- Python Requests: The requests library allows users to send HTTP/1.1 requests without needing to attach query strings to URLs or form-encode POST data
- LXML: A tool used for processing XML and HTML in the Python language
- Urllib: A package used for opening URLs
- Pandas: Not typically used for scraping, but useful for data analysis, manipulation, and storage
However, for the purposes of this tutorial, we'll be focusing on just three: Beautiful Soup 4 (BS4), Selenium, and the statistics.py module.
Build a web scraper with Python
Let's say we want to compare the prices of women's jeans on Madewell and NET-A-PORTER to see who has the better price.
For this tutorial, we'll build a web scraper to help us compare the average prices of products offered by two similar online fashion retailers.
Step 1: Select the URLs you want to scrape
For both Madewell and NET-A-PORTER, you'll want to grab the target URL from their webpage for women's jeans.
For Madewell, this URL is:
https://www.madewell.com/womens/clothing/jeans
For NET-A-PORTER, your URL will be:
https://www.net-a-porter.com/en-us/
Step 2: Find the HTML content that you want to scrape
Once you've selected your URLs, you'll want to figure out what HTML tags or attributes your desired data will be located under. For this step, you'll want to inspect the source of your webpage (or open the Developer Tools Panel).
You can do this with a right-click on the page you're on, and selecting Inspect from the drop-down menu.
Google Chrome Shortcut:
Ctrl + Shift + C
for Windows orCommand + Shift + C
for MacOS will let you view the HTML code for this step
In this case, we're looking for the price of jeans. If you look through the HTML document, you'll notice that this information is available under the <span>
tag for both Madewell and NET-A-PORTER.
However, using the <span>
tag would retrieve too much irrelevant data because it's too generic. We want to narrow down our target when data scraping, and we can get more specific by using attributes inside of the <span>
tag instead.
For Madewell, a better HTML attribute would be:
product-sales-price product-usd
For NET-A-PORTER, we'd want to narrow down our target with:
itemprop
Step 3: Choose your tools and libraries
For this task, we will be using the Selenium and Beautiful Soup 4 (BS4) libraries in addition to the statistics.py module. Here's a quick breakdown of why we chose these web scraping tools:
Selenium
Selenium can automatically open a web browser and run tasks in it using a simple script. The Selenium library requires a web browser's driver to be accessed, so we decided to use Google Chrome and downloaded its drivers from here: ChromeDriver Downloads
BS4
We're using BS4 with Python's built-in HTML parser because it's simple and beginner-friendly. A BS4 object gives us access to tools that can scrape any given website through its tags and attributes.
Scrapy is another Python library that would have been suitable for this task, but it's a little more complex than BS4.
statistics.py
The statistics.py
module contains methods for calculating mathematical statistics of numeric data.
Method | Description |
---|---|
statistics.mean() |
Calculates the mean (average) of the given data |
Step 4: Build your web scraper in Python
4a: Import your libraries
First, you'll want to import statistics
, requests
, webdriver
from selenium
, and the beautifulsoup
library.
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
import statistics
4b: Create a Google Chrome driver object
PATH = r'/usercode/chromedriver'
driver = webdriver.Chrome(PATH)
4c: Fetch NET-A-PORTER's website data from their URL using the driver.get
method
driver.get("https://www.net-a-porter.com/en-us/shop/clothing/jeans")
4d: Create a beautifulsoup
object
Here, we create a beautifulsoup
object with the HTML source as driver.page_source
and Python's built-in HTML parser, html.parser
, as arguments.
This starts the web scraper search for specific tags and attributes.
soup = BeautifulSoup(driver.page_source),
'html.parser')
response = soup.find_all("span", {"itemprop" : "price"})
4e: Save the price data into a list, then print it
data = []
for item in response:
data.append(float(item.text.strip("\n$)))
print(data)
4f: Print the mean of your data from each website
print(statistics.mean(extracted_data1))
print(statistics.mean(extracted_data2))
Completed code
from bs4 import BeautifulSoup
from selenium import webdriver
import statistics
def shop1():
PATH = r'/usercode/chromedriver'
driver = webdriver.Chrome(PATH)
driver.get("https://www.net-a-porter.com/en-us/shop/clothing/jeans")
soup = BeautifulSoup(driver.page_source, 'html.parser')
response = soup.find_all("span", {"itemprop" : "price"})
data = []
for item in response:
data.append(float(item.text.strip('$')))
print(data)
return data
def shop2():
PATH = r'usercode/chromedriver'
driver = webdriver.Chrome(PATH)
driver.get("https://www.madewell.com/womens/clothing/jeans")
soup = BeautifulSoup(driver.page_source, 'html.parser')
response = soup.find_all("span", "product-sales-price product-usd")
data = []
for item in response:
data.append(float(item.text.strip("\n$")))
print(data)
return data
extracted_data1 = shop1()
extracted_data2 = shop2()
print(statistics.mean(extracted_data1))
print(statistics.mean(extracted_data2))
Step 5: Repeat for Madewell
Using the above code, you can repeat the steps for Madewell. As a quick reminder, here are the basic steps you'll need to follow:
- Assign the webdriver file path to a path variable
- Using Selenium, create a driver object
- Get the page source using the
driver.get
method of the driver object - Make a BS4 object with the HTML source using the
driver.pagesource
method, and Python's built-in HTML parser,html.parser
as arguments - Use the
find_all
method to extract the data in the tags you identified into a list - Clean the data extracted by using
.text
and.strip()
- Append the data to a list for comparison
Wrapping up and next steps
Congratulations! You've built your first web scraper with Python. By now, you might have a better idea of just how useful web scraping can be, and we encourage you to keep learning more about Python if you want to develop the skills to create your own APIs.
You might not master Python in a single day, but hopefully, this tutorial has helped you realize that Python is much more approachable than you might expect.
To help you master Python, we've created the Predictive Data Analysis with Python course.
Happy learning!
Continue learning about Python on Educative
- A complete guide to web development in Python
- 50 Python interview questions and answers
- Level up your Python skills with these 6 challenges
Start a discussion
Have you worked with web scraping in the past? Was this article helpful? Let us know in the comments below!
Top comments (0)