INTRODUCTION TO WEB SCRAPING WITH PYTHON: A four-week beginner-to-advanced-level course in web scraping using the Python programming language.
“If programming is magic, then web scraping is surely a form of wizardry.”
Web Scraping With Python; Ryan Mitchell.
To those who have not developed the skill, computer programming can seem like a kind of magic. If programming is magic, web scraping is wizardry: the application of magic for particularly impressive and useful – yet surprisingly effortless – feats.
What is Web Scraping?
In the past twenty years, there has been an explosion of data – volume and availability – that has been matched only by the explosion that the Internet itself has experienced. Data Scientists estimate that in the last three years, more data has been generated, stored and computed, than has been in the rest of history. In fact, in 2013 IBM (a company that made some of the earliest personal computers) estimated that 90% of data available on the Internet had been generated within the last two years. This surely gives us a surprisingly accurate perspective on the exponential growth curve of data available over the Internet.
This vast amount of data is what makes it possible to have such internet applications as payment gateways and social media, that power most emerging businesses (i.e organisations formed within the last thirty years) and even many established ones (businesses birthed more than thirty years ago). Really, it can be said that data powers the web and drives the Internet.
Where does all the data go?
Data that has been collected has to be stored somewhere so that it can be accessed again, retrieved, compared, updated, manipulated and shared. For example, an automobile company might collect some personal data about their customers for generating mailing lists that will be used to share information about their future self-driving cars. This data can come from a variety of sources: a HTML form on their official website, detachable cards on their brochures, forms that their customers fill during purchases, or even data from advertising agencies!
This data needs to be stored somewhere so that it can be retrieved at a later date and used or updated.
Most data collected over the web (the web is short for World Wide Web, which is a vast network of computers and computerized devices that are connected over the Internet) is stored on special computers called servers and servers aren’t difficult to understand. Basically, all a server does is wait until a request is made for a file, then it locates that file, if it exists, and sends it to whomever made the request.
At the heart of data handling is the need to retrieve data. This is where web scrapers come in.
So, what are web scrapers?
A web scraper is a program that is designed to load a web document (remember that data is stored on servers in the form of documents; when these documents are accessed over the web, they are called web documents), parse it (a fancy word that means “to understand”), and extract the needed information from it. Before the end of this article, you would have written several web scrapers.
Why learn to scrape the web?
There are numerous ways to access data over the web and, if we’re being fair, web scraping is only one of them. Data can be retrieved from APIs (Application Programming Interfaces, which I’ll write about soon), manipulated with SDKs (Software Development Kits), visualized with macros, and generally accessed (and assessed) in at least ten different ways.
So why web scrapers?
Firstly, all of the special software listed above (APIs, SDKs and macros) are higher-level programs that require special programming skills to develop adequately, and they may not always be available. For example, a college student who starts a free WordPress blog might not have the finances required for hiring a software developer to develop APIs for her audience. This is where web scrapers step in: that gap that so often exists between where data is stored and the people who need to access it. Generally, almost every data or document that can be viewed in a browser can be retrieved with a scraper. In fact, web scrapers can be designed to log-in to your social media page and retrieve data (like how many likes you have or who just DM-ed you), even while you are asleep! Really advanced web scrapers are used by social media agencies to keep track of their clients’ social accounts and automate many actions like “liking”, commenting or replying messages.
Industries where web scraping knowledge can be applied:
What makes knowledge of web scraping such an in-demand skill is how quickly it can be used to solve a wide range of data problems in a wide range of industries, even in industries where software isn't the Number One tool.
Some industries that benefit from web scraping are:
- Data Science/Data Engineering: Web scraping is often used to mine very useful data that can be analyzed for patterns and used to build much larger applications or to solve problems.
- Information Security: With the vast amounts of information available over the web, and the attendant need to restrict access to private information from the public, Information Security is a fast-growing industry. It has its foundations in cyber security and data engineering, and web scraping is definitely an important skill in any company in this industry.
- Web Development: Designing and developing software for the web is known as web development. Websites, web apps and server-side web software are often developed around a central store of data that can be retrieved from external sources using web scrapers. For example, quite recently, I had to develop a real estate listing web app for a client. I was given three websites to poll (that is, repeatedly assess) and cull data from. The data I culled would be saved to my client's database and used to furnish his own website using APIs. While this must have been a difficult task for my competitors to accomplish (we had to pitch to be awarded the contract), I immediately saw the opportunity for three scrapers that would get the job done. In fact, using Selenium one of my scrapers was able to automatically (without my help) sign in to a website, navigate to a listings page, search for the required search text in the search field, extract the data and finally save the data to the database! Pretty amazing work for a program, right?!
- Advertising and Media: Web scrapers help advertising/media agencies to obtain data that can be run through data analysis, during which important information can be extracted. Information gotten by this method is often necessary for the critical decisions that must be taken in such companies.
- Security Firms, Militaries and Intelligence: By this point you can probably list a few applications of web scrapers in companies that specialize in personal/properties security, militaries and intelligence organisations.
Your first web scraper.
Now that you know so much about scrapers, what they are and why knowledge of how to write one is important, it is time to write your first scraper.
Step 1: Install Python and set up your programming environment.
The programming language we will be using to write our web scrapers is called Python. Python is one of the most popular, most versatile, programming languages in the world. It has been used to develop hundreds of thousands of computer applications, and today we will add yours to this figure!
To install Python on Windows or Mac:
- Go to https://downloads.python.org
- Choose the write installation file for your computer (i.e Windows or Mac). Be sure to download the latest version (at the time of writing this article, the latest version is version 3.9.1).
- Twiddle your thumbs and congratulate yourself in advance for writing your first scraper while the installation file is being downloaded.
- Once the download is completed, double-click on it to run the installation. This shouldn’t take more than ten minutes.
- You are now ready to start programming in Python.
To install Python on Linux:
If you run Ubuntu or a similar Linux distro, you can install the latest version of Python from the Ubuntu Software Center (also called Synaptics on some distros).
- In your app drawer, search for ‘Software Center’ or ‘Synaptics’ and run the app.
- In the search field, type ‘Python’ and press the Enter key. Your computer must be connected to the Internet.
- Click the Python software package and install it.
- After Python is installed, you should see the Python icon added to the desktop or the app drawer. Python is now installed and ready to be used.
- In the Ubuntu Software Center (or Synaptics), search for, and install, ‘Python IDLE’. Windows and Mac users will have the IDLE installed automatically when they run their installation files.
- You are now ready to continue.
Step 2: Start the Python IDLE, click the 'File' button in the menu tab and select 'New File'.
from urllib.request import urlopen
html = urlopen("https://google.com")
print(html.read())
Save this file (CTRL + S, or CMD + S on Macs) as firstscraper.py (you can even save your program in a folder and build a collection of similar scripts by saving them all to the same folder).
Step 3: In the menu tab (of the Python IDLE where you wrote your scraper), click 'Run', then 'Run Module'. Your computer must be connected to the Internet for this awesome step to work. Wait a few seconds (how long depends on your network speed) and... you should see a load of gibberish miraculously fill your screen.
Congratulations! You have successfully written your first scraper!!!
Now we will analyze our code and understand exactly what we have written and what all that gibberish means. Feel free to pause here, stretch, yawn, and sip some soda water because, soon, you will become a badass Python programmer!
Understanding network requests: what goes on under the hood of our scrapers?
At the heart of every scraper is the network request. An understanding of networks and network requests is important for understanding web scrapers. Without this knowledge, we will not be able to write really practical web scrapers or similar advanced software (like a social media bot). Fortunately, networks and network requests aren't difficult to understand at all.
A computer network is like a spider web. It is simply a connection between two or more computer devices. The Internet is the largest and most popular of them all. It is an enormous, gigantic, connection of millions (maybe billions) of computer devices that range in size and purpose from tiny, invisible-to-the-naked-eye nano-computers used in medical research, to IoT (IoT stands for Internet of Things. Go figure!) devices like the special computers that opens the doors at the supermarket when a person steps near them to more common computer devices like smartphones and personal computers. When you ran (that is, executed) your scraper, your program became a part of this global, massive network for a few seconds! Impressive, isn't it?
Other examples of computer networks are WiFi's, paired bluetooth devices, and the infrared connection between your TV and its remote.
So, what is a network request?
Given that data is stored on servers and data can be accessed as web documents, there is only one small left to know about the data-retrieval process: network requests.
A network request is simply a request for a web document that you or the programs you write make to the server for the data it stores.
The second line of our scraper script:
html = urlopen("https://google.com")
makes a network request to Google's homepage and stores the response in the 'html' variable (a variable is a temporary storage location in the computer's memory; you can find out more on variables on the Internet).
Anatomy of a network request
- A computer (or browser or program) called the 'user agent' makes a request to the server for a document.
- The server checks to see if it has the document in storage.
- If it does, it returns the document to the user agent. This is called a network response. (So a network response is the server's reply to a network request.)
- If the file does not exist on the server, a '404 HTTP status code' is returned as the network response. We will learn more about status codes in general, and the 404 code in particular, in this series.
Understanding the rest of our scraper
We have seen the meaning of our second line in the context of network requests and network responses. The first and second lines are much easier to understand, and are integral to almost every scraper that we will write:
from urllib.request import urlopen
This line simply imports 'urlopen' method of the 'request' object of the 'urllib' module. This might not make sense now, and if it doesn't all you need to do is take a course in foundational Python programming. Some really amazing courses can be found online on Udemy, W3Schools and YouTube. Moreover, you can read some textbooks or join a Python programming bootcamp (my personal recommendation).
print(html.read())
The third line merely 'reads' the network response (Google's homepage) as a web document (this particular web document was saved as a HTML file) and 'prints' it by displaying its contents.
There's all there is to our first web scraper. Next, we will introduce the amazing BeautifulSoup library and begin to improve our web scraping abilities. Do feel free to take five, stretch your muscles, eat some cereals and generally congratulate yourself for a job well done!
Introducing BeautifulSoup
Beautiful Soup, so rich and green,
Waiting in a hot tureen!
Who for such dainties would not stoop?
Soup of the evening, beautiful Soup!
The BeautifulSoup library was named after a poem by Lewis Caroll in Alice's Adventures In Wonderland.
Like its name, the BeautifulSoup library converts a mish-mash of ingredients (web documents and incompatible file formats, encodings, and sources) into something edible, even appreciable. Yum yum!
Installing BeautifulSoup
The BeautifulSoup library is not packaged by default with Python, unlike the urllib library, so it must be installed before it can be imported and used to make awesome scrapers. For most computers, we will need to install pip, a special Python software that we will finally use to install that most interesting of Soups.
For Windows systems
Versions of Python later than 3.4 usually install pip by default during the installation. You might already have pip installed, especially if you followed the Python installation steps at the beginning of this article.
- First, verify that you don't have pip installed. Open Command Prompt and enter:
pip --help
If the output begins with "Usage: pip [options], you have pip installed.
Else,
- Enter the following:
python get-pip.py
- This should install pip for you.
For Macs
If pip isn't already installed (which you can verify by running:
pip --help
in the Terminal), running
sudo easy_install pip
should get it done.
For Linux
Nothing for you here. Skip to the next section.
Finally, to install BeautifulSoup
On Windows and Macs
In the Terminal/Command Prompt, run:
pip install beautifulsoup4
This will take a few seconds to download and install BeautifulSoup.
On Linux
In the Terminal, run:
sudo apt-get install python-bs4
There. You are now ready to explore BeautifulSoup and become a master web scraper. Wait. That came out wrong. Web scrapist? If the software that scrapes the web is called a 'web scraper', what is the person who writes the web scraper called?
(It is highly likely that you will run into problems installing pip or BeautifulSoup. If errors occur or issues arise, please don't hesitate to reach out to me via email @ rhemafortune@gmail.com).
A More Advanced Web Scraper Using BeautifulSoup
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page1.html")
bsoup = BeautifulSoup(html.read(), 'html.parser')
print(bsoup.h1)
Write this in a new file (remember: Click 'File' > 'New File') and save it. Then run it by clicking 'Run' > 'Run Module'. After a few moments (during which the program will be making a network request for the document) you should see output similar to:
<h1>An Interesting Title</h1>
What this advanced scraper does is to make a network request, save the network response (to the html
variable), "parse" it into an object format (which is what Python can really understand), and print the first h1
tag it finds.
You can view the entire web document in the network response by modifying your code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page1.html")
bsoup = BeautifulSoup(html.read(), 'html.parser')
print(bsoup)
Notice the absence of h1
in the print statement. This will print the entire BeautifulSoup object (stored in bsoup
) and you should see output similar to:
<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing
elit, sed do eiusmod tempor incididunt ut labore et
dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.
</div>
</body>
</html>
And changing the last line to print(bsoup.div)
should print:
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing
elit, sed do eiusmod tempor incididunt ut labore et
dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.
</div>
The BeautifulSoup
method which creates the BeautifulSoup object bsoup
takes two "arguments" in order to cook up our steamy soup. They are:
- A document to be parsed.
- A parser that should be used to parse the document.
Your choice of the parser will always depend on the document format of the web document you want to scrape. For example, for scraping HTML web documents, the
html.parser
parser is used. In fact, 90% of the scrapers you write will use this parser, since HTML is the most ubiquitous document format on the web. You can play around with your scraper, make requests to different URLs (Uniform Resource Locator; aka website address), print all of your documents or just small section sections of them. In the next article we will more fully explore the BeautifulSoup library and write some really practical web scrapers that will extract practical information (like apartments available for sale in Lagos, Nigeria). We will learn the full capabilities of BeautifulSoup, its methods and arguments and the correct way to handle network requests that might arise when our script is trying to extract information. Finally, I will leave several juicy exercises that you can attempt and strengthen your knowledge of the wizardry that is web scraping.
Cheers!
Top comments (0)