DEV Community

Aviator
Aviator

Posted on • Originally published at Medium

Data extraction with Python and BeautifulSoup

An important prerequisite for making data-driven decisions is the acquisition of data.

Data extraction, also known as data gathering, is the process of acquiring data that facilitates data analysis. Given that not all data holds relevance, it is essential to exercise caution while gathering data that proves useful for data analysis.

The primary objective of this article is to offer a guide on how to extract data from websites. Not all websites provide a public API that allows for data retrieval. Consequently, in the absence of an API, resorting to web scraping becomes the viable approach for acquiring data from a website.

Installing Necessary Libraries

Before you begin, ensure that you have Python installed on your computer. It's recommended to have Python version 3.7 or newer.

First, create a folder for your project. Then, open the command prompt and navigate to your created folder.

Once you're in the correct folder using the command prompt, enter the following command:

pip install beautifulsoup4
pip install requests
pip install jupyter
Enter fullscreen mode Exit fullscreen mode

In this tutorial, we will extract website content for a list of all states in the USA, along with their abbreviated state names.

You can find the website we'll be working with at the following URL: USA_STATES_AND_ABBR

After successfully installing the required libraries, open the command prompt and enter the following command:

jupyter notebook
Enter fullscreen mode Exit fullscreen mode

This will launch a Jupyter coding environment instance.

Import libraries and make a request to the website

To obtain your browser's user-agent, simply enter "my user agent" (without quotes) in your browser's search engine.

Now, let's proceed with the code.
On line 1, import the necessary libraries. On lines 2 and 3, set the headers and the URL for the website you want to scrape. Then, make a request to the website using the requests library. The response from the website will be stored in a variable named "response."

save website html contents to a file

On line 4, use the open() command in Python to save the response from the website to a file named "usa_states.html". This approach helps avoid making repeated HTTP calls to the website.

On line 5, read from the saved file, and use the BeautifulSoup library to parse the contents of the website, which will be transformed into a Python object.

On line 6, use the soup.find() method to search for the <table> element. Once found, search for all <tr> tags that are children of the table element.

html structure from the website
The resulting list is stored in the variable named states_table.

loop through the html table structure

To iterate through the states_table list, we can use a loop to find the <td> elements that are children of the <tr> element. When found, we can append the result to the states_names variable. To avoid duplication, we can clear any previously saved content in the states_names list using the states_names.clear() method.

On line 10, we can use the CSV module, which is built-in, to write the result to a CSV file named "usa_states.csv".

Delete html file

Finally, we can delete the "usa_states.html" file, which was used to store the contents of the HTML website.

Conclusion

Here, a simple method of website scraping to extract data has been demostrated. While website scraping can sometimes be challenging, I hope this article provides you with a basic understanding of how to scrape a website.

Top comments (0)