How to build a Data Pipeline for the first time

#database #python #sql #beginners

I am fascinated by Machine Learning models and the incredible tools they offer our world, whether for making decisions, predicting trends, improving lives and even saving them. Those models are trained and tested on tremendous amounts of data that is constantly being collected and stored in databases.

I was really curious to learn how raw data gets into programs and what processes are being done to make this raw data useful, even before training a brilliant Machine Learning model.

One type of data that we use on a daily basis for predictions is weather observations. Obviously, weather forecasting is mostly based on complicated physics equations and statistical models. Still, I thought it would be fun to build my own weather database for practicing my Data Science skills.

Instead of connecting temperature sensors to my computer and opening my own meteorological station at home (which could be very cool for my next project), I decided to build a simple data pipeline to see how it's done.

My mission was to build a pipeline from scratch - starting with retrieving weather data from the OpenWeatherMap current weather web API, parsing the data using Pandas (python data analysis) library and storing it in a local SQLite database.

Here's how I did it:

Step 1: Get an API key

OpenWeatherMap offers a few paid API plans, some are monthly subscriptions with great powerful tools. I was happy to start with the free plan which offers access to current weather data from any city.

Step 2: Keep your API key private

You wouldn't want some stranger to have your ATM password, right? API keys can also be used to steal from your wallet. Especially if you are a paying customer for the service this API provides, and if you have limited API calls. Environment variables come in handy when dealing with this problem, since they are variables whose value is set outside the program (highly recommended tutorial here). By making my API key an environment variable, it is hidden outside of my program and can be used without being exposed on my public GitHub repository. I saved my API key into a new file and called it .env .This is what it looks like on the inside:

api-token = "typeyourapikeyhere"

Now we can start coding our pipeline. Before coding I took some time to design my program and eventually decided to separate my program into files by responsibility.
My program design guidelines were:

Each function does one thing.
Each function is tested on its own, in order to make sure it actually does what it's supposed to do. Also, it saves a big headache when other bugs pile up and you have to figure out which function is the problematic one. Here I chose to use doctest for simple small tests.

For the next steps, make sure that:

Python 3 and pip are installed already.
If you are using the files on my GitHub repository, you can install the dependencies by running the following command:

pip install -r requirements.txt

In the folder where you'll run my program, you need a folder called data_cache. This is where all the data will be saved before storing it in the database.

Step 3: Create a new empty Database

In a python file, I created a SQLite Database with the sqlite3 library:

Step 4: Retrieve the data and save as a json file

At this point you will be able to get the data in json format and save it as a json file in your current folder. Each json file is named after the "dt" value which stands for datetime. Please notice that the datetime format is Unix Epoch Timestamp.

Step 5: From json file to dictionary to Pandas Dataframe

This might seem a bit Sisyphean, but I preferred to break down the process into as many "baby steps" as possible. To me, it's organized, clear and helps keeping track of each step.

Step 6: ETL with Pandas

The ETL procedure enables us to extract and transform the data according to our analysis needs, and then load it to our data warehouse. In order to make my data useful for future Data Science projects, I made sure my database will contain necessary parameters for daily temperature prediction (current temperature in Celsius, minimal and maximal temperatures in Celsius, humidity, pressure and wind). Also, I chose the datetime ("dt") column to be the row index as a primary key for my database.

Step 7: Update the Database

Now that we have the current weather data saved in a dataframe, we can easily load it to our database by using the Pandas library.

For flexibility, I used the argparse library, which lets you run main.py from the command line and give it a city id as an optional argument. So even though I defined Tel Aviv city by default, the user can still run the program for any city in the world. For example, if you would like to get the weather data of Detroit, US:

python3 src/main.py --city_id "4990729"

When running main.py, steps 3-7 are executed:

Done!

And there you have it! A pipeline you can build on your own. Eventually, the program is meant to be run on a schedule to build a database over time for my next Data Science project. For now, Windows Scheduler is a great way to start, but I recommend checking out Python Scheduler as well. There are some wonderful tutorials out there, just waiting to be explored.

Edit 10/18/20 :

The program runs on a default schedule every 15 minutes by using a while loop, and you can set the frequency to any you would like. For example, in order to run it every 5 seconds, run the command:

python3 src/main.py --frequency 5

All the Python files are currently stored in a local folder called src. In the next post I will explain how to create a docker image that only copies the Python files and dependencies that are necessary for this project.

Top comments (10)

Waylon Walker • Oct 29 '20

This might seem a bit Sisyphean, but I preferred to break down the process into as many "baby steps" as possible. To me, it's organized, clear and helps keeping track of each step

Love this idea. I often make my pipeline nodes from a single line as well! Its more about each node doing a single operation that is understandable. I like that I can go to your start function and read through exactly what the pipeline does.

This is not too far off of how I would run projects before moving to kedro. Not that its much different with kedro, just a bit more is taken care of for me, which is really nice as these things grow in run time and number of nodes. Going all in on a 40hr pipeline only to see errors can be a nightmare to debug without being able to see what happened.