DEV Community

loading...

The pursuit of happiness with Python and PostgreSQL

Francesco Tisiot
Francesco is a Developer Advocate at Aiven. Long time data practitioner and analytics enthusiast!
Originally published at aiven.io on ・8 min read

The world is seldom perfect, and the same goes for real-life data. Read on to find out how to clean up the data in your PostgreSQL database so that it's usable for analysis.

Hero Image Elephant

Creating information is like any creative process. It starts with finding the materials (in this case, datasources) and continues with cleaning, joining and wrangling datasets. By the time you finish you've built something beautiful and gained new insights to share - and then you start all over again.

One of the most widely-used languages in the creation of information is Python, loved by data scientists, engineers and analysts for its great ecosystem of existing libraries for data wrangling.

This blog post explores three of such libraries: pandas, ddlgenerator and psycopg2. These three enable us to clean a dataset and push it to a PostgreSQL database where the data can later be queried and exposed to a huge variety of company figures.

Today, the objective is simple: we will explore a happiness dataset and try to find out where in the world we should move to have joyful life!

0. Get the dataset

As mentioned before, begin by finding a dataset. Kaggle is a website widely used in the data science community, providing datasets used for challenges, competitions or learning. They have this nice dataset about World Happiness which only requires a login to be downloaded. It contains 5 CSV files, one per year, listing the happiness ranking of various countries together with some other indicators.

Let's create a folder named happiness and a subfolder named data for storing the file files:

mkdir -p happiness/data
cd happiness
Enter fullscreen mode Exit fullscreen mode

Next, download the data from Kaggle and place the 5 CSVs in the happiness/data folder.

1. Clean the data

As it happens frequently, the source datasets are not always tidy and clean. This is valid for our happiness dataset too.
When browsing the files we can note that 2015.csv and 2016.csv share the same format of column names and data. The 2017.csv has the Region column missing, some others named and ordered differently (Generosity and Trust are swapped). The same applies to the files 2018.csv and 2019.csv where we can't find the confidence intervals anymore. Moreover, the year is only contained in the file name, and not in a column.

Before starting our cleaning efforts, it's a good practice to create a separate virtual environment every time we start a new Python project. This ensures our default Python environment is not overloaded with possibly conflicting libraries. We can create a virtual environment and enable it like this:

python -m venv python_pg
source python_pg/bin/activate
Enter fullscreen mode Exit fullscreen mode

The files need some work in order to produce a consistent view of the data, so let's put on our Python superhero mask. Start by installing the pandas library, which will help us with data manipulation and analysis, by executing the following code in your terminal:

pip install pandas
Enter fullscreen mode Exit fullscreen mode

Now we can create a file named prepare.py in our happiness folder and start including some Python code in it. We'll start by importing the library installed earlier, together with os and glob which should be installed by default and enable the interaction with the operative system and pathname pattern expansion respectively.

import pandas as pd
import glob, os
Enter fullscreen mode Exit fullscreen mode

Next, define the column names that we're going to use in our target DataFrame named all_years_df to store the dataset once cleaned. Add the following lines to the prepare.py file:

column_names = ['survey_yr', 'country', 'overall_rank',
                'score', 'gdp', 'generosity', 'freedom',
                'social_support', 'life_exp', 'gov_trust']

all_years_df = pd.DataFrame()
Enter fullscreen mode Exit fullscreen mode

Now it's time to bring order to chaos and parse all the CSV files in the data folder. Add the following code to the prepare.py file:

files = glob.glob('data/*.csv')

for fp in files:
    # Handling file names as new column
    file_year = os.path.basename(fp).split('.')[0]
    file_content = pd.read_csv(fp).assign(Year=int(file_year)-2000)

    # Picking the right column for each file based on the year
    uniformed_columns = find_columns(file_content, file_year)

    # Conforming column names and appending the parsed year data to all_years_df
    uniformed_columns.columns = column_names
    all_years_df = pd.concat([all_years_df, uniformed_columns])

all_years_df.to_csv('results.csv', index=False)
Enter fullscreen mode Exit fullscreen mode

The glob function returns the list of files in the happiness/data folder, that we loop over. For each file, we read the filename with os.path.basename, split the name on the dot (.) and take the first part, extracting only the year information. The file_content line uses pandas' read_csv function to load the file data and populates the new column Year (using the assign function) with the last two digits of the year (int(file_year)-2000).

Next we call the find_columns function, which we'll define soon, to select the correct columns from the source file depending on the year.

After that, we consolidate the column names in the uniformed_columns DataFrame containing only one year of data, and then append it to the all_years_df DataFrame that will contain the whole dataset at the end of the process. Lastly we store the end result in a CSV file named results.csv in the happiness folder.

Picking the correct columns

The missing part is the find_columns function definition where we consolidate the column names and order. As mentioned above, depending on the year, we need to select the correct columns from the source file. Add the following code to the prepare.py file, just after the import statements, to perform the trick on various years:

def find_columns(file_content, file_year):
    if file_year in ['2015', '2016']:
        uniformed_columns = file_content[[
            'Year', 'Country', 'Happiness Rank', 'Happiness Score',
            'Economy (GDP per Capita)', 'Generosity', 'Freedom', 'Family',
            'Health (Life Expectancy)', 'Trust (Government Corruption)'
            ]]

    elif file_year in ['2017']:
        uniformed_columns = file_content[[
            'Year', 'Country', 'Happiness.Rank', 'Happiness.Score',
            'Economy..GDP.per.Capita.', 'Generosity', 'Freedom', 'Family',
            'Health..Life.Expectancy.', 'Trust..Government.Corruption.'
            ]]

    else:
        uniformed_columns = file_content[[
            'Year', 'Country or region', 'Overall rank', 'Score',
            'GDP per capita', 'Generosity', 'Freedom to make life choices',
            'Social support', 'Healthy life expectancy',
            'Perceptions of corruption'
            ]]

    return uniformed_columns
Enter fullscreen mode Exit fullscreen mode

The function picks the correct column for each year and returns the resulting DataFrame.

It's now time to check that the pipeline is working. With the terminal positioned on the happiness folder let's execute the following:

python prepare.py
Enter fullscreen mode Exit fullscreen mode

Now check the file results.csv in the happiness folder to verify that the code worked. The results should look like this:

All Years Data

2. Create a PostgreSQL instance

The second step is to store our data in a PostgreSQL table. But before storing, we need a PG Database, let's create one with Aiven CLI in our terminal

avn service create demo-pg    \
  -t pg                       \
  --cloud google-europe-west3 \
  -p hobbyist
Enter fullscreen mode Exit fullscreen mode

The above command creates a PostgreSQL instance named demo-pg on google-europe-west3 using a hobbyist plan. You can review all PostgreSQL plans in our pricing page. To connect to it, we'll need the service uri containing the information like host, port, and the default avnadmin user's password. We can retrieve it with:

avn service get demo-pg --format '{service_uri}'
Enter fullscreen mode Exit fullscreen mode

Now a couple of minutes of relax, while waiting for the instance to be ready with the following call in the terminal

avn service wait demo-pg
Enter fullscreen mode Exit fullscreen mode

3. Push the data to PostgreSQL

Once the service is running, go back to Python and prepare the data push.

We'll use the ddlgenerator library to automatically generate the SQL statement to create the PostgreSQL table, and psycopg2 to interact with the database. We can install both with the following line in the terminal:

pip install psycopg2 ddlgenerator
Enter fullscreen mode Exit fullscreen mode

Now create a new push.py file and add the import and connect to the database statements. Replace in the following code the <SERVICE_URI_OUTPUT> parameter with the output of the avn service get command above:

import psycopg2

conn = psycopg2.connect('<SERVICE_URI_OUTPUT>')
Enter fullscreen mode Exit fullscreen mode

The connection is established. We can now create the table and push the data. ddlgenerator provides a nice functionality to create both table DDLs and insert statements from a CSV file. We can append the following code to the push.py file:

from ddlgenerator.ddlgenerator import Table

table = Table('results.csv', table_name='HAPPINESS')
sql = table.sql('postgresql', inserts=True)
Enter fullscreen mode Exit fullscreen mode

After importing the Table class from ddlgenerator, the code creates an instance parsing the results.csv file, with a target table name of HAPPINESS. The last line creates the CREATE table definition SQL for PostgreSQL, including the insert statements for each line (inserts=True). If we print the SQL statement it should look like the following:

DROP TABLE IF EXISTS happiness;

CREATE TABLE happiness (
    survey_yr INTEGER NOT NULL,
    country VARCHAR(24) NOT NULL,
    overall_rank INTEGER NOT NULL,
    score DECIMAL(15, 14) NOT NULL,
    gdp DECIMAL(17, 16) NOT NULL,
    generosity DECIMAL(16, 16) NOT NULL,
    freedom DECIMAL(16, 16) NOT NULL,
    social_support DECIMAL(16, 15) NOT NULL,
    life_exp DECIMAL(17, 16) NOT NULL,
    gov_trust DECIMAL(16, 16)
);

INSERT INTO happiness (survey_yr, country, overall_rank, score, gdp, generosity, freedom, social_support, life_exp, gov_trust) VALUES (19, 'Finland', 1, 7.769, 1.34, 0.153, 0.596, 1.587, 0.986, 0.393);
INSERT INTO happiness (survey_yr, country, overall_rank, score, gdp, generosity, freedom, social_support, life_exp, gov_trust) VALUES (19, 'Denmark', 2, 7.6, 1.383, 0.252, 0.592, 1.573, 0.996, 0.41);
...
Enter fullscreen mode Exit fullscreen mode

We can now execute the sql statement against the database by adding the following code to the push.py file

cur = conn.cursor()
cur.execute(sql)
conn.commit()
cur.close()
conn.close()
Enter fullscreen mode Exit fullscreen mode

We created a cursor, executed the sql statement, committed the change to the database and closed both cursor and connection. It's now time to run the code. With the same terminal on the happiness folder, run the following:

python push.py
Enter fullscreen mode Exit fullscreen mode

Congrats, the data is now loaded in PostgreSQL.

Query the Data in PostgreSQL

Let's verify that the data is correctly populated in the PostgreSQL happiness table. Create a new Python file query.py with the following content (as before, amend the connection service uri):

import psycopg2

conn = psycopg2.connect('<SERVICE_URI_OUTPUT>')

cur = conn.cursor()
cur.execute('''
SELECT SURVEY_YR, COUNTRY, OVERALL_RANK
from HAPPINESS
WHERE OVERALL_RANK <= 3
ORDER BY SURVEY_YR, OVERALL_RANK
''')

print(cur.fetchall())
cur.close()
conn.close()
Enter fullscreen mode Exit fullscreen mode

As per the push.py file, we are creating a connection to PostgreSQL, then executing a query to check which countries were in the top three position over the various years. The result should be similar to:

[
(15, 'Switzerland', 1),   (15, 'Iceland', 2),       (15, 'Denmark', 3),
(16, 'Denmark', 1),       (16, 'Switzerland', 2),   (16, 'Iceland', 3),
(17, 'Norway', 1),        (17, 'Denmark', 2),       (17, 'Iceland', 3),
(18, 'Finland', 1),       (18, 'Norway', 2),        (18, 'Denmark', 3),
(19, 'Finland', 1),       (19, 'Denmark', 2),       (19, 'Norway', 3)
]
Enter fullscreen mode Exit fullscreen mode

Looks like Northern European countries are a great place to live! Take particular notice of how Aiven's home country, Finland, has been on top in the last two years.

The data, now available in a PostgreSQL relational table, can be queried and exposed by all major analytics and reporting tools, making it accessible by a wider audience.

Wrapping up

Finding the right dataset is just the beginning: cleaning, wrangling and publishing the data creates information, which helps people in their decision-making. Python is a great fit in this area and widely adopted in the data community. The pandas, ddlgenerator and psycopg2 libraries mentioned in the blog enable an easy data handling and push to PostgreSQL. This makes for a solid base for data discovery or corporate reporting.

Some more info in this area:

Discussion (0)