Data engineering refers to the development of software that performs three tasks:
- Extract raw data from various sources.
- Transform and clean up the data.
- Load the data into a data warehouse, i.e., a database management system like, Google BigQuery or Amazon Redshift. Alternatively,load the data into MySQL, a relational database.
The process is complex involving numerous software tools and advanced technical expertise. For one thing is can be challenging and overwhelming to identify the starting point and resources from which to acquire the basic knowledge. No wonder the demand for data engineers in the technology sector has risen by an order of magnitude in the past years. As the need for data generation grows, the thirst for data engineers will continue to intensify.
This post, part 1 of a three-part tutorial shows you how to build a simple ETL[extract,transform,and load] pipeline with CSV files in Python, describes the extraction steps.
You will need the following software:
glob, a package that is preinstalled that searches for files in relative paths.
For details, see this section for documentation.
pandas, a package used for data manipulation. To install, type on the command line,
pip install pandas.
For details, see the pandas site.
virtualenvwrapper, a package that sets up virtual environments for Python. For details, see the related documentation.
This Kaggle dataset for the CSV data.
Here's an example of CSV data on car sales:
Create a project called
Create a virtual environment with the command line
Extract the zip file and move the CSV files for
etl_car_salesdirectory, like this:
- Create an
extract_from_csvfunction to read the CSV file in your main directory and then return a Dataframe with the
- Create an
extractfunction to read the data from the extracted csv file and append the data to the Dataframe that is being return the
Tip: To rename the columns, do the following:
- Print out the
extracted_datavariable and the result would read like this:
Part 2 will describe how to transform data in the ETL pipeline. Stay Tuned.