DEV Community: mirandaauhl

PostgreSQL vs Python for data cleaning: A guide

mirandaauhl — Wed, 08 Dec 2021 22:34:22 +0000

Introduction

During analysis, you rarely - if ever - get to go directly from evaluating data to transforming and analyzing it. Sometimes to properly evaluate your data, you may need to do some pre-cleaning before you get to the main data cleaning, and that’s a lot of cleaning! In order to accomplish all this work, you may use Excel, R, or Python, but are these the best tools for data cleaning tasks?

In this blog post, I explore some classic data cleaning scenarios and show how you can perform them directly within your database using TimescaleDB and PostgreSQL, replacing the tasks that you may have done in Excel, R, or Python. TimescaleDB and PostgreSQL cannot replace these tools entirely, but they can help your data munging/cleaning tasks be more efficient and, in turn, let Excel, R, and Python shine where they do best: in visualizations, modeling, and machine learning.

Cleaning is a very important part of the analysis process and generally can be the most grueling from my experience! By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks one time rather than repetitively within a script, saving me considerable time in the long run.

A recap of the data analysis process

I began this series of posts on data analysis by presenting the following summary of the analysis process:

The first three steps of the analysis lifecycle (evaluate, clean, transform) comprise the “data munging” stages of analysis. Historically, I have done my data munging and modeling all within Python or R, these being excellent options for analysis. However, once I was introduced to PostgreSQL and TimescaleDB, I found how efficient and fast it was to do my data munging directly within my database. In my previous post, I focused on showing data evaluation techniques and how you can replace tasks previously done in Python with PostgreSQL and TimescaleDB code. I now want to move on to the second step, data cleaning. Cleaning may not be the most glamorous step in the analysis process, but it is absolutely crucial to creating accurate and meaningful models.

As I mentioned in my last post, my first job out of college was at an energy and sustainability solutions company that focused on monitoring all different kinds of utility usage - such as electricity, water, sewage, you name it - to figure out how our clients’ buildings could be more efficient. My role at this company was to perform data analysis and business intelligence tasks.

Throughout my time in this job, I got the chance to use many popular data analysis tools including Excel, R, and Python. But once I tried using a database to perform my data munging tasks - specifically PostgreSQL and TimescaleDB - I realized how efficient and straightforward analysis, and particularly cleaning tasks, could be when done directly in a database.

Before using a database for data cleaning tasks, I would often find either columns or values that needed to be edited. I would pull the raw data from a CSV file or database, then make any adjustments to this data within my Python script. This meant that every time I ran my Python script, I would have to wait for my machine to spend computational time setting up and cleaning my data. This means that I lost time with every run of the script. Additionally, if I wanted to share cleaned data with colleagues, I would have to run the script or pass it along to them to run. This extra computational time could add up depending on the project.

Instead, with PostgreSQL, I can write a query to do this cleaning once and then store the results in a table. I wouldn’t need to spend time cleaning and transforming data again and again with a Python script, I could just set up the cleaning process in my database and call it a day! Once I started to make cleaning changes directly within my database, I was able to skip performing cleaning tasks within Python and simply focus on jumping straight into modeling my data.

To keep this post as succinct as possible, I chose to only show side-by-side code comparisons for Python and PostgreSQL. If you have any questions about other tools or languages, please feel free to join our Slack channel, where you can ask the Timescale community, or me, specific questions about Timescale or PostgreSQL functionality 😊. I’d love to hear from you!

Additionally, as we explore TimescaleDB and PostgreSQL functionality together, you may be eager to try things out right away! Which is awesome! The easiest way to get started is by signing up for a free 30-day trial of Timescale Cloud (if you prefer self-hosting, you can always install and manage TimescaleDB on your own PostgreSQL instances). Learn more by following one of our many tutorials.

Now, before we dip into things and get our data, as Outkast best put it, “So fresh, So clean”, I want to quickly cover the data set I will be using. In addition, I also want to note that all the code I show will assume you have some basic knowledge of SQL. If you are not familiar with SQL, don’t worry! In my last post, I included a section on SQL basics which you can find here.

About the sample dataset

In my experience within the data science realm, I have done the majority of my data cleaning after evaluation. However, sometimes it can be beneficial to clean data, evaluate, and then clean again. The process you choose is dependent on the initial state of your data and how easy it is to evaluate. For the data set I will use today, I would likely do some initial cleaning before evaluation and then clean again after, and I will show you why.

I got the following IoT data set from Kaggle, where a very generous individual shared their energy consumption readings from their apartment in San Jose CA, this data incrementing every 15 minutes. While this is awesome data, it is structured a little differently than I would like. The raw data set follows this schema:

and appears like this…

type	date	start_time	end_time	usage	units	cost
Electric usage	2016-10-22	00:00:00	00:14:00	0.01	kWh	$0.00
Electric usage	2016-10-22	00:15:00	00:29:00	0.01	kWh	$0.00
Electric usage	2016-10-22	00:30:00	00:44:00	0.01	kWh	$0.00
Electric usage	2016-10-22	00:45:00	00:59:00	0.01	kWh	$0.00
Electric usage	2016-10-22	01:00:00	01:14:00	0.01	kWh	$0.00
Electric usage	2016-10-22	01:15:00	01:29:00	0.01	kWh	$0.00
Electric usage	2016-10-22	01:30:00	01:44:00	0.01	kWh	$0.00
Electric usage	2016-10-22	01:45:00	01:59:00	0.01	kWh	$0.00

In order to do any type of analysis on this data set, I want to clean it up. A few things that quickly come to mind include:

The cost is seen as a text data type which will cause some issues.
The time columns are split apart which could cause some problems if I want to create plots over time or perform any type of modeling based on time.
I may also want to filter the data based on various parameters that have to do with time, such as day of the week or holiday identification (both potentially play into how energy is used within the household).

In order to fix all of these things and get more valuable data evaluation and analysis, I will have to clean the incoming data! So without further ado, let’s roll up our sleeves and dig in!

Cleaning the data

I will show most of the techniques I have used in the past while working in data science. While these examples are not exhaustive, I hope they will cover many of the cleaning steps you perform during your own analysis, helping to make your cleaning tasks more efficient by using PostgreSQL and TimescaleDB.

Please feel free to explore these various techniques and skip around if you need! There is a lot here, and I designed it to be a helpful glossary of tools that you could use as you need.

The techniques that I will cover include:

Correcting structural issues
Creating or generating relevant data
Adding data to a hypertable
Renaming columns or tables
Fill in missing values

Note on cleaning approach:

There are many ways that I could approach the cleaning process in PostgreSQL. I could create a table then ALTER it as I clean, I could create multiple tables as I add or change data, or I could work with VIEWs. Depending on the size of my data, any of these approaches could make sense, however, they will have different computational consequences.

You may have noticed above that my raw data table was called energy_usage_staging. This is because I decided that given the state of my raw data, it would be best for me to place the raw data in a staging table, clean it using VIEWs, then insert it into a more usable table as part of my cleaning process. This move from raw table to the usable table could happen even before the evaluation step of analysis. As I discussed above, sometimes data cleaning has to occur after AND before evaluating your data. Regardless, this data needs to be cleaned and I wanted to use the most efficient method possible. In this case, that meant using a staging table and leveraging the efficiency and power of PostgreSQL VIEWs, something I will talk about later.

Generally, if you are dealing with a lot of data, altering an existing table in PostgreSQL can be costly. For this post, I will show you how to build up clean data using VIEWs along with additional tables. This method of cleaning is more efficient and sets you up for the next blog post about data transformation which includes the use of scripts in PostgreSQL.

Correcting structural issues

Right off the bat, I know that I need to do some data refactoring on my raw table due to data types. Notice that we have date and time columns separated and costs is recorded as a text data type. I need to convert my separated date time columns to a timestamp and the cost column to float4. But before I show that, I want to talk about why conversion to timestamp is beneficial.

TimescaleDB hypertables and why timestamp is important

For those of you not familiar with the structure of TimescaleDB hypertables, they are at the basis of how we efficiently query and manipulate time-series data. Timescale hypertables are partitioned based on time, and more specifically by the time column you specify upon creation of the table.

The data is partitioned by timestamp into "chunks" so that every row in the table belongs to some chunk based on a time range. We then use these time chunks to help query the rows so that you can get more efficient querying and data manipulation based on time. This image represents the difference between a normal table and our special hypertables.

Changing date-time structure

Because I want to utilize TimescaleDB functionality to the fullest, such as continuous aggregates and faster time based queries, I want to restructure the energy_usage_staging table's date and time columns. I could use the date column for my hypertable partitioning, however, I would have limited control over manipulating my data based on time. It is more flexible and space efficient to have a single column with a timestamp than it is to have separate columns with date and time. I can always extract the date or time from the timestamp if I want to later!

Looking back at the table structure, I should be able to get a usable timestamp value from the date and start_time columns as the end_time really doesn’t give me that much useful information. Thus, I want to essentially combine these two columns to form a new timestamp column, let’s see how I can do that using SQL. Spoiler alert, it is as simple as an algebraic statement. How cool is that?!

PostgreSQL code:
In PostgreSQL I can create the column without inserting it into the database just yet. Since I want to create a NEW table from this staging one, I don’t want to add more columns or tables just yet.

Let’s first compare the original columns with our new generated column. For this query I simply add the two columns together. The AS keyword just allows me to rename the column to whatever I would like, in this case being time.

--add the date column to the start_time column
SELECT date, start_time, (date + start_time) AS time 
FROM energy_usage_staging eus;

Results:

date	start_time	time
2016-10-22	00:00:00	2016-10-22 00:00:00.000
2016-10-22	00:15:00	2016-10-22 00:15:00.000
2016-10-22	00:30:00	2016-10-22 00:30:00.000
2016-10-22	00:45:00	2016-10-22 00:45:00.000
2016-10-22	01:00:00	2016-10-22 01:00:00.000
2016-10-22	01:15:00	2016-10-22 01:15:00.000

Python code:
In Python, the easiest way to do this is to add a new column to the dataframe. Notice that in Python I would have to concatenate the two columns along with a defined space, then convert that column to datetime.

energy_stage_df['time'] = pd.to_datetime(energy_stage_df['date'] + ' ' + energy_stage_df['start_time'])
print(energy_stage_df[['date', 'start_time', 'time']])

Changing column data types

Next, I want to change the data type of my cost column from text to float. Again, this is straightforward in PostgreSQL with the TO_NUMBER() function.

The format of the function is as follows: TO_NUMBER(‘text’, ‘format’) . The ‘format’ input is a PostgreSQL specific string that you can build depending on what type of text you want to convert. In our case we have a $ symbol followed by a numeric set up 0.00. For the format string I decided to use ‘L99D99’. The L lets PostgreSQL know there is a money symbol at the beginning of the text, the 9s let the system know I have numeric values, and then the D stands for a decimal point.

I decided to cap the conversion on values that would be less than or equal to ‘$99.99’ because the cost column has no values greater than 0.65. If you were planning to convert a column with larger numeric values, you would want to account for that by adding in a G for commas. For example, say you have a cost column with text values like this ‘$1,672,278.23’ then you would want to format the string like this ‘L9G999G999D99’

PostgreSQL code:

--create a new column called cost_new with the to_number() function
SELECT cost, TO_NUMBER("cost", 'L9G999D99') AS cost_new
FROM energy_usage_staging eus  
ORDER BY cost_new DESC

Results:

cost	cost_new
$0.65	0.65
$0.65	0.65
$0.65	0.65
$0.57	0.57
$0.46	0.46
$0.46	0.46
$0.46	0.46
$0.46	0.46

Python code:
For Python, I used a lambda function that systematically replaces all the ‘$’ signs with empty strings. This can be fairly inefficient.

energy_stage_df['cost_new'] = pd.to_numeric(energy_stage_df.cost.apply(lambda x: x.replace('$','')))
print(energy_stage_df[['cost', 'cost_new']])

Creating a `VIEW`

Now that I know how to convert my columns, I can combine the two queries and create a VIEW of my new restructured table. A VIEW is a PostgreSQL object which allows you to define a query and call it by it’s VIEWs name, as if it were a table within your database. I can use the following query to generate the data I want and then create a VIEW that I can query it as if it were a table.

PostgreSQL code:

-- query the right data that I want
SELECT type, 
(date + start_time) AS time, 
"usage", 
units, 
TO_NUMBER("cost", 'L9G999D99') AS cost, 
notes 
FROM energy_usage_staging

Results:

type	time	usage	units
Electric usage	2016-10-22 00:00:00.000	0.01	kWh
Electric usage	2016-10-22 00:15:00.000	0.01	kWh
Electric usage	2016-10-22 00:30:00.000	0.01	kWh
Electric usage	2016-10-22 00:45:00.000	0.01	kWh
Electric usage	2016-10-22 01:00:00.000	0.01	kWh
Electric usage	2016-10-22 01:15:00.000	0.01	kWh
Electric usage	2016-10-22 01:30:00.000	0.01	kWh
Electric usage	2016-10-22 01:45:00.000	0.01	kWh
Electric usage	2016-10-22 02:00:00.000	0.02	kWh
Electric usage	2016-10-22 02:15:00.000	0.02	kWh

I decided to call my VIEW energy_view. Now, when I want to do further cleaning, I can just specify its name in the FROM statement.

--create view from the query above
CREATE VIEW energy_view AS
SELECT type, 
(date + start_time) AS time, 
"usage", 
units, 
TO_NUMBER("cost", 'L9G999D99') AS cost, 
notes 
FROM energy_usage_staging

Python code:

energy_df = energy_stage_df[['type','time','usage','units','cost_new','notes']]
energy_df.rename(columns={'cost_new':'cost'}, inplace = True)
print(energy_df.head(20))

It is important to note that with PostgreSQL VIEWs, the data inside of them have to be recalculated every time you query it. This is why we want to insert our VIEW data into a hypertable once we have the data set up just right. You can think of VIEWs as a shorthand version of the CTEs WITH AS statement I discussed in my last post.

We are now one step closer to cleaner data!

Creating or generating relevant data

With some quick investigation, we can see that the notes column is blank for this data set. To check this I just need to include a WHERE clause and specify where notes are not equal to an empty string.

PostgreSQL code:

SELECT * 
FROM energy_view ew
-- where notes are not equal to an empty string
WHERE notes!='';

Results come out empty

Python code:

print(energy_df[energy_df['notes'].notnull()])

Since the notes are blank, I would like to replace the column with various sets of additional information that I could use later on during modeling. One thing I would like to add in particular, is a column that specifies the day of the week. To do this I can use the EXTRACT() command. The EXTRACT() command is a PostgreSQL date/time function that allows you to extract various date/time elements. For our column, PostgreSQL has the specification DOW (day-of-week) which maps 0 to Sunday through to 6 for Saturday.

PostgreSQL code:

--extract day-of-week from date column and cast the output to an int
SELECT *,
EXTRACT(DOW FROM time)::int AS day_of_week
FROM energy_view ew

Results:

type	time	usage	units	day_of_week
Electric usage	2016-10-22 00:00:00.000	0.01	kWh	6
Electric usage	2016-10-22 00:15:00.000	0.01	kWh	6
Electric usage	2016-10-22 00:30:00.000	0.01	kWh	6
Electric usage	2016-10-22 00:45:00.000	0.01	kWh	6
Electric usage	2016-10-22 01:00:00.000	0.01	kWh	6
Electric usage	2016-10-22 01:15:00.000	0.01	kWh	6

Python code:

energy_df['day_of_week'] = energy_df['time'].dt.dayofweek

Additionally, we may want to add another column that specifies if a day occurs over a weekend or weekday. I will do this by creating a boolean column, where true represents a weekend, and false represents a weekday. To do this, I will apply a CASE statement. With this command I can specify “when-then” statements (similar to “if-then” statements in coding) where I can say WHEN a day_of_week value is IN the set (0,6) THEN the output should be true, ELSE the value should be false.

PostgreSQL code:

SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
--use the case statement to make a column true when records fall on a weekend aka 0 and 6
CASE WHEN (EXTRACT(DOW FROM time)::int) IN (0,6) then true
    ELSE false
END AS is_weekend
FROM energy_view ew

Results:

type	time	usage	units	day_of_week	is_weekend
Electric usage	2016-10-22 00:00:00.000	0.01	kWh	6	true
Electric usage	2016-10-22 00:15:00.000	0.01	kWh	6	true
Electric usage	2016-10-22 00:30:00.000	0.01	kWh	6	true
Electric usage	2016-10-22 00:45:00.000	0.01	kWh	6	true
Electric usage	2016-10-22 01:00:00.000	0.01	kWh	6	true

Fun fact: you can do the same query without a CASE statement, however it only works for binary columns.

--another method to create a binary column
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend
FROM energy_view ew

Python code:
Notice that in Python, the weekends are represented by numbers 5 and 6 vs the PostgreSQL weekend values 0 and 6.

energy_df['is_weekend'] = np.where(energy_df['day_of_week'].isin([5,6]), 1, 0)
print(energy_df.head(20))

And maybe things then start getting real crazy, maybe you want to add more parameters!

Let’s consider holidays. Now you may be asking “Why in the world would we do that?!”, but often people have time off during some of the holidays within the US. Since this individual lives within the US, they likely have at least some of the holidays off whether they are the day of OR a federal holiday. Where there are days off, there could be a difference in energy usage. To help guide my analysis, I want to include the identification of holidays. To do this, I’m going to create another boolean column that identifies when a federal holiday occurs.

To do this, I am going to use TimescaleDB’s time_bucket() function. The time_bucket() function is one of the functions I discussed in detail within my previous post. Essentially, I need to use this function to make sure all time values within a single day get accounted for. Without using the time_bucket() function, I would only see changes to the row associated with the 12am time period.

PostgreSQL code:
After I create a holiday table, I can then use the data from it within my query. I also decided to use the non-case syntax for this query. Note that you can use either!

--create table for the holidays
CREATE TABLE holidays (
date date)

--insert the holidays into table
INSERT INTO holidays 
VALUES ('2016-11-11'), 
('2016-11-24'), 
('2016-12-24'), 
('2016-12-25'), 
('2016-12-26'), 
('2017-01-01'),  
('2017-01-02'), 
('2017-01-16'), 
('2017-02-20'), 
('2017-05-29'), 
('2017-07-04'), 
('2017-09-04'), 
('2017-10-9'), 
('2017-11-10'), 
('2017-11-23'), 
('2017-11-24'), 
('2017-12-24'), 
('2017-12-25'), 
('2018-01-01'), 
('2018-01-15'), 
('2018-02-19'), 
('2018-05-28'), 
('2018-07-4'), 
('2018-09-03'), 
('2018-10-8')

SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
-- I can then select the data from the holidays table directly within my IN statement
time_bucket('1 day', time) IN (SELECT date FROM holidays) AS is_holiday
FROM energy_view ew

Results:

type	time	usage	units	day_of_week	is_weekend	is_holiday
Electric usage	2016-10-22 00:00:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 00:15:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 00:30:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 00:45:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 01:00:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 01:15:00.000	0.01	kWh	6	true	false

Python code:

holidays = ['2016-11-11', '2016-11-24', '2016-12-24', '2016-12-25', '2016-12-26', '2017-01-01',  '2017-01-02', '2017-01-16', '2017-02-20', '2017-05-29', '2017-07-04', '2017-09-04', '2017-10-9', '2017-11-10', '2017-11-23', '2017-11-24', '2017-12-24', '2017-12-25', '2018-01-01', '2018-01-15', '2018-02-19', '2018-05-28', '2018-07-4', '2018-09-03', '2018-10-8']
energy_df['is_holiday'] = np.where(energy_df['day_of_week'].isin(holidays), 1, 0)
print(energy_df.head(20))

At this point, I’m going to save this expanded table into another VIEW so that I can call the data without writing out the query.

PostgreSQL code:

--create another view with the data from our first round of cleaning
CREATE VIEW energy_view_exp AS
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
time_bucket('1 day', time) IN (select date from holidays) AS is_holiday
FROM energy_view ew

You may be asking, “Why did you create these as boolean columns??”, a very fair question! You see, I may want to use these columns for filtering during analysis, something I commonly do during my own analysis process. In PostgreSQL, when you use boolean columns you can filter things super easily. For example, say that I want to use my table query so far and show only the data that occurs over the weekend AND a holiday. I can do this simply by adding in a WHERE statement along with the specified columns.

PostgreSQL code:

--if you use binary columns, then you can filter with a simple WHERE statement
SELECT *
FROM energy_view_exp
WHERE is_weekend AND is_holiday

Results:

type	time	usage	units	cost	day_of_week	is_weekend	is_holiday
Electric usage	2016-12-24 00:00:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 00:15:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 00:30:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 00:45:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 01:00:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 01:15:00.000	0.34	kWh	0.06	6	true	true

Python code:

print(energy_df[(energy_df['is_weekend']==1) & (energy_df['is_holiday']==1)].head(10))

Adding data to a hypertable

Now that I have new columns ready to go and I know how I would like my table to be structured, I can create a new hypertable and insert my cleaned data. In my own analysis with this data set, I may have done the cleaning up to this point BEFORE evaluating my data so that I can get a more meaningful evaluation step in analysis. What’s great is that you can use any of these techniques for general cleaning, whether that is before or after evaluation.

PostgreSQL:

CREATE TABLE energy_usage (
type text,
time timestamptz,
usage float4,
units text,
cost float4,
day_of_week int,
is_weekend bool,
is_holiday bool,
) 

--command to create a hypertable
SELECT create_hypertable('energy_usage', 'time')

INSERT INTO energy_usage 
SELECT *
FROM energy_view_exp

Note that if you had data continually coming in you could create a script within your database that automatically makes these changes when importing your data. That way you can have cleaned data ready to go in your database rather than processing and cleaning the data in your scripts every time you want to perform analysis.

We will discuss this in detail in my next post, so make sure to stay tuned in if you want to know how to create scripts and keep data automatically updated!

Renaming values

Another valuable technique for cleaning data is being able to rename various items or remap categorical values. The importance of this skill is amplified by the popularity of this Python data analysis question on StackOverflow. The question states “How do I change a single index value in a pandas dataframe?”. Since PostgreSQL and TimescaleDB use relational table structures, renaming unique values can be fairly simple.

When renaming specific index values within a table, you can do this “on the fly” by using PostgreSQL’s CASE statement within the SELECT query. Let’s say I don’t like Sunday being represented by a 0 in the day_of_week column, but would prefer it to be a 7. I can do this with the following query.

PostgreSQL code:

SELECT type, time, usage, cost, is_weekend,
-- you can use case to recode column values 
CASE WHEN day_of_week = 0 THEN 7
ELSE day_of_week 
END
FROM energy_usage

Python code:
Caveat, this code would make Monday = 7 because the python DOW function has Monday set to 0 and Sunday set to 6. But this is how you would update one value within a column. Likely you would not want to do this exact action, I just wanted to show the python equivalent for reference.

energy_df.day_of_week[energy_df['day_of_week']==0] = 7
print(energy_df.head(250))

Now, let’s say that I wanted to actually use the names of the days of the week instead of showing numeric values? For this example, I actually want to ditch the CASE statement and create a mapping table. When you need to change various values, it will likely be more efficient to create a mapping table and then join to this table using the JOIN command.

PostgreSQL:

--first I need to create the table
CREATE TABLE day_of_week_mapping (
day_of_week_int int,
day_of_week_name text
)

--then I want to add data to my table
INSERT INTO day_of_week_mapping
VALUES (0, 'Sunday'),
(1, 'Monday'),
(2, 'Tuesday'),
(3, 'Wednesday'),
(4, 'Thursday'),
(5, 'Friday'),
(6, 'Saturday')

--then I can join this table to my cleaning table to remap the days of the week
SElECT type, time, usage, units, cost, dowm.day_of_week_name, is_weekend
FROM energy_usage eu
LEFT JOIN day_of_week_mapping dowm ON dowm.day_of_week_int = eu.day_of_week

Results:

type	time	usage	units	cost	day_of_week_name	weekend
Electric usage	2018-07-22 00:45:00.000	0.1	kWh	0.03	Sunday	true
Electric usage	2018-07-22 00:30:00.000	0.1	kWh	0.03	Sunday	true
Electric usage	2018-07-22 00:15:00.000	0.1	kWh	0.03	Sunday	true
Electric usage	2018-07-22 00:00:00.000	0.1	kWh	0.03	Sunday	true
Electric usage	2018-02-11 23:00:00.000	0.04	kWh	0.01	Sunday	true

Python:
In this case, python has similar mapping functions.

energy_df['day_of_week_name'] = energy_df['day_of_week'].map({0 : 'Sunday', 1 : 'Monday', 2: 'Tuesday', 3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday'})
print(energy_df.head(20))

Hopefully, one of these techniques will be useful for you as you approach data renaming!

Additionally, remember that if you would like to change the name of a column in your table, it is truly as easy as AS (I couldn’t not use such a ridiculous statement 😂). When you use the SELECT statement, you can rename you columns like so,

PostgreSQL code:

SELECT type AS usage_type,
time as time_stamp,
usage,
units, 
cost AS dollar_amount
FROM energy_view_exp
LIMIT 20;

Results:

usage_type	time_stamp	usage	units
Electric usage	2016-10-22 00:00:00.000	0.01	kWh
Electric usage	2016-10-22 00:15:00.000	0.01	kWh
Electric usage	2016-10-22 00:30:00.000	0.01	kWh
Electric usage	2016-10-22 00:45:00.000	0.01	kWh

Python code:
Comparatively, renaming columns in Python can be a huge pain. This is an area where SQL is not only faster, but also just more elegant in its code.

energy_df.rename(columns={'type':'usage_type', 'time':'time_stamp', 'cost':'dollar_amount'}, inplace=True)
print(energy_df[['usage_type','time_stamp','usage','units','dollar_amount']].head(20))

Fill in missing data

Another common problem in the data cleaning process is having missing data. For the dataset we are using, there are no obviously missing data points, however, it is very possible that with evaluation, we could find missing hourly data from a power outage or some other phenomenon. This is where the gap-filling functions TimescaleDB offers could come in handy. When using algorithms, missing data can often have significant negative impacts on the accuracy or dependability of the model. Sometimes, you can navigate this problem by filling in missing data with reasonable estimates and TimescaleDB actually has built-in functions to help you do this.

For example, let’s say that you are modeling the energy usage over individual days of the week and a handful of days have missing energy data due to a power outage or an issue with the sensor. We could remove the data, or try to fill in the missing values with reasonable estimations. For today, let’s assume that the model I want to use would benefit more from filling in the missing values.

As an example, I created some data. I called this table energy_data and it is missing both time and energy readings for the timestamps between 7:45am and 11:30am.

time	energy
2021-01-01 07:00:00.000	0
2021-01-01 07:15:00.000	0.1
2021-01-01 07:30:00.000	0.1
2021-01-01 07:45:00.000	0.2
2021-01-01 11:30:00.000	0.04
2021-01-01 11:45:00.000	0.04
2021-01-01 12:00:00.000	0.03
2021-01-01 12:15:00.000	0.02
2021-01-01 12:30:00.000	0.03
2021-01-01 12:45:00.000	0.02
2021-01-01 13:00:00.000	0.03

I can use TimescaleDB’s gapfilling hyperfunctions to fill in these missing values. The interpolate() function is another one of TimescaleDB’s hyperfunctions and it creates data points that follow a linear approximation given the data points before and after the missing range of data. Alternatively, you could use the locf() hyperfunction which carries the last recorded value forward to fill in the gap (note that locf stands for last-one-carried-forward). Both of these functions must be used in conjunction with the time_bucket_gapfill() function.

PostgreSQL code:

SELECT
--here I specified that the data should increment by 15 mins
  time_bucket_gapfill('15 min', time) AS timestamp,
  interpolate(avg(energy)),
  locf(avg(energy))
FROM energy_data
--to use gapfill, you will have to take out any time data associated with null values. You can do this using the IS NOT NULL statement
WHERE energy IS NOT NULL AND time > '2021-01-01 07:00:00.000' AND time < '2021-01-01 13:00:00.000'
GROUP BY timestamp
ORDER BY timestamp;

Results:

timestamp	interpolate	locf
2021-01-01 07:00:00.000	0.1	0.10000000000000000000
2021-01-01 07:30:00.000	0.15	0.15000000000000000000
2021-01-01 08:00:00.000	0.13625	0.15000000000000000000
2021-01-01 08:30:00.000	0.1225	0.15000000000000000000
2021-01-01 09:00:00.000	0.10875	0.15000000000000000000
2021-01-01 09:30:00.000	0.095	0.15000000000000000000
2021-01-01 10:00:00.000	0.08125	0.15000000000000000000
2021-01-01 10:30:00.000	0.0675	0.15000000000000000000
2021-01-01 11:00:00.000	0.05375	0.15000000000000000000
2021-01-01 11:30:00.000	0.04	0.04000000000000000000
2021-01-01 12:00:00.000	0.025	0.02500000000000000000
2021-01-01 12:30:00.000	0.025	0.02500000000000000000

Python code:

energy_test_df['time'] = pd.to_datetime(energy_test_df['time'])
energy_test_df_locf = energy_test_df.set_index('time').resample('15 min').fillna(method='ffill').reset_index()
energy_test_df = energy_test_df.set_index('time').resample('15 min').interpolate().reset_index()
energy_test_df['locf'] = energy_test_df_locf['energy']
print(energy_test_df)

Bonus:
The following query is how I could ignore the missing data. I wanted to include this to show you just how easy it can be to exclude null data. Alternatively, I could use a WHERE clause to specify the times which I could like to ignore (the second query).

SELECT * 
FROM energy_data 
WHERE energy IS NOT NULL

SELECT * 
FROM energy_data
WHERE time <= '2021-01-01 07:45:00.000' OR time >= '2021-01-01 11:30:00.000'

Wrap Up

After reading through these various cleaning techniques, I hope you feel more comfortable with exploring some of the possibilities that PostgreSQL and TimescaleDB provide. By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks a single time rather than repetitively within a script, thus saving me time in the long run. If you are looking to save time and effort while cleaning your data for analysis, definitely consider using PostgreSQL and TimescaleDB.

In my next posts, I will go over techniques on how to transform data using PostgreSQL and TimescaleDB. I'll then take everything we've learned together to benchmark data munging tasks in PostgreSQL and TimescaleDB vs. Python and pandas. The final blog post will walk you through the full process on a real dataset by conducting a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).

If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our community Slack, where you'll find an active community of time-series enthusiasts and various Timescale team members (including me!).

If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can sign up for a free 30-day trial or install TimescaleDB and manage it on your current PostgreSQL instances. We also have a bunch of great tutorials to help get you started.

Until next time!

Functionality Glossary:

Adding columns together
TO_NUMBER()
VIEW
WHERE
EXTRACT()
CASE
time_bucket()
JOIN
AS
CREATE TABLE
create_hypertable()
INSERT INTO
time_bucket_gapfill()

PostgreSQL vs Python for data evaluation: what, why, and how

mirandaauhl — Thu, 07 Oct 2021 20:15:11 +0000

Introduction
SQL basics
A quick note on the data
Evaluating the data
Wrap up

Introduction

As I started writing this post, I realized that to properly show how to evaluate, clean, and transform data in the database (also known as data munging), I needed to focus on each step individually. This blog post will show you exactly how to use TimescaleDB and PostgreSQL to perform your data evaluation tasks that you may have previously done in Excel, R, or Python. TimescaleDB and PostgreSQL cannot replace these tools entirely, but they can help your data munging/evaluation tasks be more efficient and, in turn, let Excel, R, and Python shine where they do best: in visualizations, modeling, and machine learning.

You may be asking yourself, “What exactly do you mean by Evaluating the data?”. When I talk about evaluating the data, I mean really understanding the data set you are working with.

If - in a theoretical world - I could grab a beer with my data set and talk to it about everything, that is what I would do during the evaluating step of my data analysis process. Before beginning analysis, I want to know every column, every general trend, every connection between tables, etc. To do this, I have to sit down and run query after query to get a solid picture of my data.

Recap

If you remember, in my last post, I summarized the analysis process as the “data analysis lifecycle” with the following steps: Evaluate, Clean, Transform, and Model.

Clean -> Transform -> Model, accompanied by icons which relate to each step" width="800" height="202">

As a data analyst, I found that all the tasks I performed could be grouped into these four categories, with evaluating the data as the first and I feel the most crucial step in the process.

My first job out of college was at an energy and sustainability solutions company that focused on monitoring all different kinds of usage - such as electricity, water, sewage, you name it - to figure out how buildings could be more efficient. They would place sensors on whatever medium you wanted to monitor to help you figure out what initiatives your group could take to be more sustainable and ultimately save costs. My role at this company was to perform data analysis and business intelligence tasks.

Throughout my time in this job, I got the chance to use many popular tools to evaluate my data, including Excel, R, Python, and heck, even Minitab. But once I tried using a database - and specifically PostgreSQL and TimescaleDB - I realized how efficient and straightforward evaluating work could be when done directly in a database. Lines of code that took me a while to hunt down online, trying to figure out how to accomplish with pandas, could be done intuitively through SQL. Plus, the database queries were just as fast, if not faster, than my other code most of the time.

Now, while I would love to show you a one-to-one comparison of my SQL code against each of these popular tools, that’s not practical. Besides, no one wants to read three examples of the same thing in a row! Thus, for comparison purposes in this blog post, I will directly show TimescaleDB and PostgreSQL functionality against Python code. Keep in mind that almost all code will likely be comparable to your Excel and R code. However, if you have any questions, feel free to hop on and join our Slack channel, where you can ask the Timescale community, or me, specifics on TimescaleDB or PostgreSQL functionality 😊. I’d love to hear from you!

Additionally, as we explore TimescaleDB and PostgreSQL functionality together, you may be eager to try things out right away! Which is awesome! If so, you can sign up for a free 30-day trial or install and manage TimescaleDB on your own PostgreSQL instances. (You can also learn more by following one of our many tutorials.)

But enough of an intro, let’s get into the good stuff!

via GIPHY

SQL basics

PostgreSQL is a database platform that uses SQL syntax to interact with the data inside it. TimescaleDB is an extension that is applied to a PostgreSQL database. To unlock the potential of PostgreSQL and TimescaleDB, you have to use SQL. So, before we jump into things, I wanted to give a basic SQL syntax refresher. If you are familiar with SQL, please feel free to skip this section!

For those of you who are newer to SQL (short for structured query language), it is the language many relational databases, including PostgreSQL, use to query data. Like pandas’ DataFrames or Excel’s spreadsheets, data queried with SQL is structured as a table with columns and rows.

The basics of a SQL SELECT command can be broken down like this 👇

SELECT --columns, functions, aggregates, expressions that describe what you want to be shown in the results
FROM --if selecting data from a table in your DB, you must define the table name here
JOIN --join another table to the FROM statement table 
    ON --a column that each table shares values 
WHERE --statement to filter results where a column or expression is equivalent to some statement
GROUP BY --if SELECT or WHERE statement contains an aggregate, or if you want to group values on a column/expression, must include columns here
HAVING --similar to WHERE, this keyword helps to filter results based upon columns or expressions specifically used with a GROUP BY query
ORDER BY --allows you to specify the order in which your data is displayed
LIMIT --lets you specify the number of rows you want displayed in the output

You can think of your queries as SELECTing data FROM your tables within your database. You can JOIN multiple tables together and specify WHERE your data needs to be filtered or what it should be GROUPed BY. Do you see what I did there 😋?

This is the beauty of SQL; these keyword's names were chosen to make your queries intuitive. Thankfully, most of PostgreSQL and SQL functionality follow this same easy-to-read pattern. I have had the opportunity to teach myself many programming languages throughout my career, and SQL is by far the easiest to read, write, and construct. This intuitive nature is another excellent reason why data munging in PostgreSQL and TimescaleDB can be so efficient when compared to other methods.

Note that this list of keywords includes most of the ones you will need to start selecting data with SQL; however, it is not exhaustive. You will not need to use all these phrases for every query but likely will need at least SELECT and FROM. The queries in this blog post will always include these two keywords.

Additionally, the order of these keywords is specific. When building your queries, you need to follow the order that I used above. For any additional PostgreSQL commands you wish to use, you will have to research where they fit in the order hierarchy and follow that accordingly.

Seeing a list of commands may be somewhat helpful but is likely not enough to solidify understanding if you are like me. So let’s look at some examples!

Let’s say that I have a table in my PostgreSQL database called energy_usage. This table contains three columns: time which contains timestamp values, energy which contains numeric values, and notes which contains string values. As you may be able to imagine, every row of data in my energy table will contain,

time: timestamp value saying when the reading was collected
energy: numeric value representing how much energy was used since the last reading
notes: string value giving additional context to each reading.

If I wanted to look at all the data within the table, I could use the following SQL query

SELECT time, energy, notes --I list my columns here
FROM energy_usage ;-- I list my table here and end query with semi-colon

Alternatively, SQL has a shorthand for ‘include all columns’, the operator *. So I could select all the data using this query as well,

SELECT *
FROM energy_usage;

What if I want to select the data and order it by the time column so that the earliest readings are first and the latest are last? All I need to do is include the ORDER BY statement and then specify the time column along with the specification ASC to let the database know I want the data in ascending order.

SELECT *
FROM energy_usage
ORDER BY time ASC;-- first I list my time column then I specify either DESC or ASC

Hopefully, you can start to see the pattern and feel more comfortable with SQL syntax. I will be showing a lot more code snippets throughout the post, so hang tight if you still need more examples!
So now that we have a little refresher on SQL basics, let’s jump into how you can use this language along with TimescaleDB and PostgreSQL functionality to do your data evaluating tasks!

A quick note on the data

Earlier I talked about my first job as a data analyst for an IoT sustainability company. Because of this job, I tend to love IoT data sets and couldn’t pass up the chance to explore this IoT dataset from Kaggle to show how to perform data munging tasks in PostgreSQL and TimescaleDB.

The data set contains two tables, one specifying energy consumption for a single home in Houston, Texas (called power_usage), and the other documenting weather conditions (called weather). This data is actually the same data set that I used in my previous post, so bonus points if you caught that 😊!

This data was recorded from January 2016 to December 2020. While looking at this data set, and all time-series data sets, we must consider any outside influences that could affect the data. The most obvious factor that impacts the analysis of this dataset is the COVID-19 pandemic that took place from January 9th through to December 2020. Thankfully, we will see that the individual recording this data included some notes to help categorize days affected by the pandemic. As I go through this blog series, we will see patterns associated with the data collected during the COVID-19 pandemic, so definitely keep this fact in the back of your mind as we perform various data munging analysis steps!

Here is an image explaining the two tables, their column names in red and corresponding data types in blue.

As we work through this blog post, we will use the evaluating techniques available within PostgreSQL and TimescaleDB to understand these two tables inside and out.

Evaluating the data

As we discussed before, the first step in the data analysis lifecycle - and arguably the most critical step - is to evaluate the data. I will go through how I would approach evaluating this IoT energy data, showing most of the techniques I have used in the past while working in data science. While these examples are not exhaustive, they will cover many of the evaluating steps you perform during your analysis, helping to make your evaluating tasks more efficient by using PostgreSQL and TimescaleDB.

The techniques that I will cover include:

Reading the raw data
Finding and observing “categorical” column values in my dataset
Sorting my data by specific columns
Displaying grouped data
Finding abnormalities in the database
Looking at general trends

Reading the raw data

Let’s start with the most simple evaluating task, looking at the raw data.

As we learned in the SQL refresher above, we can quickly pull all the data within a table by using the SELECT statement with the * operator. Since I have two tables within my database, I will query both table’s information by running a query for each.

PostgreSQL code:

-- select all the data from my power_usage table
SELECT * 
FROM power_usage pu; 
-- selects all the data from my weather table
SELECT * 
FROM weather w;

But what if I don’t necessarily need to query all my data? Since all the data is housed in the database, if I want to get a feel for the data and the column values, I could just look at a snapshot of the raw data.

While conducting analysis in Python, I often would just print a handful of rows of data to get a feel for the values. We can do this in PostgreSQL by including the LIMIT command within our query. To show the first 20 rows of data in my tables, I can do the following:

PostgreSQL code:

-- select all the data from my power_usage table
SELECT * 
FROM power_usage pu
LIMIT 20; -- specify 20 because I only want to see 20 rows of data
-- selects all the data from my weather table
SELECT * 
FROM weather w 
LIMIT 20;

Results: Some of the rows for each table

startdate	value_kwh	day_of_week	notes
2016-01-06 01:00:00	1	2	weekday
2016-01-06 02:00:00	1	2	weekday
2016-01-06 03:00:00	1	2	weekday
2016-01-06 04:00:00	1	2	weekday
2016-01-06 05:00:00	0	2	weekday
2016-01-06 06:00:00	0	2	weekday

date	day	temp_max	temp_avg	temp_min	dew_max	dew_avg	dew_min	hum_max	hum_avg	hum_min	wind_max	wind_avg	press_max	press_avg	press_min	precipit	day_of_week
2016-01-06	1	85	75	68	74	71	66	100	89	65	21	10	30	30	30	0	2
2016-02-06	2	76	71	66	74	70	66	100	97	89	18	8	30	30	30	4	5
2016-02-07	2	95	86	76	76	73	69	94	67	43	12	6	30	30	30	0	6
2016-02-08	2	97	87	77	77	74	71	94	66	43	15	5	30	30	30	0	0
2016-02-09	2	95	85	77	75	74	70	90	70	51	16	7	30	30	30	0	1
2016-02-10	2	86	74	65	64	61	58	90	66	40	8	4	30	30	30	0	2
2016-03-06	3	79	72	68	72	70	68	100	94	72	18	5	30	30	30	3	6

Python code:

In this first Python code snippet, I show the modules I needed to import and the connection code that I would have to run to access the data from my database and import it into a pandas DataFrame.

One of the challenges I faced while data munging in Python was the need to run through the entire script again and again when evaluating, cleaning, and transforming the data. This initial data pulling process usually takes a good bit of time, so it was often frustrating to run through it repetitively. I also would have to run print anytime I wanted to quickly glance at an array, Dataframe, or element. These kinds of extra tasks in Python can be time-consuming, especially if you end up at the modeling stage of the analysis lifecycle with only a subset of the original data! All this to say, keep in mind that for the other code snippets within the blog, I will not include this as part of the code; however, it still impacts that code in the background.

Additionally, because I have my data housed in a TimescaleDB instance, I still need to use the SELECT statement to query the data from the database and read it into Python. If you use a relational database - which I explained is very beneficial to analysis in my previous post - you will have to use some SQL.

import psycopg2
import pandas as pd
import configparser
import numpy as np
import tempfile
import matplotlib.pyplot as plt

## use config file for database connection information
config = configparser.ConfigParser()
config.read('env.ini')

## establish conntection
conn = psycopg2.connect(database=config.get('USERINFO', 'DB_NAME'),
                       host=config.get('USERINFO', 'HOST'),
                       user=config.get('USERINFO', 'USER'),
                       password=config.get('USERINFO', 'PASS'),
                       port=config.get('USERINFO', 'PORT'))


## define the queries for copying data out of our database (using format to copy queries)                    
query_weather = "select * from weather"
query_power = "select * from power_usage"
## define function to copy the data to a csv
def copy_from_db(query, cur):
    with tempfile.TemporaryFile() as tmpfile:
        copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
            query=query, head="HEADER"
            )
        cur.copy_expert(copy_sql, tmpfile)
        tmpfile.seek(0)
        df = pd.read_csv(tmpfile)
        return df
## create cursor to use in function above and place data into a file
cursor = conn.cursor()
weather_df = copy_from_db(query_weather, cursor)
power_df = copy_from_db(query_power, cursor)
cursor.close()
conn.close()


print(weather_df.head(20))
print(power_df.head(20))

Finding and observing “categorical” column values in my dataset

Next, I think it is essential to understand any “categorical” columns - columns with a finite set of values - that I might have. This is useful in analysis because categorical data can give insight into natural groupings that often occur within a dataset. For example, I would assume that energy usage for many people is different on a weekday vs. a weekend. We can’t verify this without knowing the categorical possibilities and seeing how each could impact the data trend.

First, I want to look at my tables and the data types used for each column. Looking at the available columns in each table, I can make an educated guess that the day_of_week, notes, and day columns will be categorical. Let’s find out if they indeed are and how many different values exist in each.

To find all the distinct values within a column (or between multiple columns), you can use the DISTINCT keyword after SELECT in your query statement. This can be useful for several data munging tasks, such as identifying categories - which I need to do - or finding unique sets of data.

Since I want to look at the unique values within each column individually, I will run a query for each separately. If I were to run a query like this 👇

SELECT DISTINCT day_of_week, notes 
FROM power_usage pu;

I would get data like this

day_of_week	notes
3	vacation
3	weekday
1	weekday
1	vacation
2	vacation
4	vacation

The output data would show unique pairs of day_of_week and notes related values within the table. This is why I need to include a single column in each statement so that I only see that individual column’s unique values and not the unique sets of values.

For these queries, I am also going to include the ORDER BY command to show the values of each column in ascending order.

PostgreSQL code:

-- selecting distinct values in the ‘day_of_week’ column within my power_usage table
SELECT DISTINCT day_of_week 
FROM power_usage pu 
ORDER BY day_of_week ASC;
-- selecting distinct values in the ‘notes’ column within my power_usage table
SELECT DISTINCT notes 
FROM power_usage pu 
ORDER BY notes ASC;

-- selecting distinct values in the ‘day’ column within my weather table
SELECT DISTINCT "day" 
FROM weather w 
ORDER BY "day" ASC;
-- selecting distinct values in the ‘day_of_week’ column within my weather table
SELECT DISTINCT day_of_week 
FROM weather w 
ORDER BY day_of_week ASC;

Results:

Notice that we see the recorder for this data included “COVID-19” as a category in their notes column. As mentioned above, this note could be necessary to finding and understanding patterns in this family's energy usage.

day_of_week
0
1
2
3
4
5
6

notes
COVID_lockdown
vacation
weekday
weekend

(Only some of the values shown for day)

day
1
2
3
4
5
6
7
8
9
10

Python code:
In my Python code, notice that I need to print anything that I want to quickly observe. I have found this to be the quickest solution, even when compared to using the Python console in debug mode.

p_day_of_the_week = power_df['day_of_week'].unique()
p_notes = power_df['notes'].unique()
w_day = weather_df['day'].unique()
w_day_of_the_week = power_df['day_of_week'].unique()
print(sorted(p_day_of_the_week), sorted(p_notes), sorted(w_day), sorted(w_day_of_the_week))

Sorting my data by specific columns

What if I want to evaluate my tables based on how specific columns were sorted? One of the top questions asked on StackOverflow for Python data analysis is "How to sort a dataframe in python pandas by two or more columns?". Once again, we can do this intuitively through SQL.

One of the things I'm interested in identifying is how bad weather impacts energy usage. To do this, I have to think about indicators that typically signal bad weather, which include high precipitation, high wind speed, and low pressure. To identify days with this pattern in my PostgreSQL weather table, I need to use the ORDER BY keyword, then call out each column in the order I want things sorted, specifying the DESC and ASC attributes as needed.

PostgreSQL code:

-- sort weather data by precipitation desc first, wind_avg desc second, and pressure asc third
SELECT "date", precipit, wind_avg, press_avg 
FROM weather w 
ORDER BY precipit DESC, wind_avg DESC, press_avg ASC;

Results:

date	precipit	wind_avg	press_avg
2017-08-27	13	15	30
2017-08-28	11	24	30
2019-09-20	9	9	30
2017-08-08	6	5	30
2017-08-29	5	22	30
2018-08-12	5	12	30
2016-02-06	4	8	30
2018-05-07	4	7	30
2019-10-05	3	9	30
2018-03-29	3	8	30
2016-03-06	3	5	30
2018-06-19	2	12	30
2019-08-05	2	11	30
2019-10-30	2	11	30

Python code:

I have often found the different pandas or Python functions to be harder to know off the top of my head. With how popular the StackOverflow question is, I can imagine that many of you also had to refer to Google for how to do this initially.

sorted_weather = weather_df[['date', 'precipit', 'wind_avg', 'press_avg']].sort_values(['precipit', 'wind_avg', 'press_avg'], ascending=[False, True, False])
print(sorted_weather)

Displaying grouped data

Finding the sum of energy usage from data that records energy per hour can be instrumental in understanding data patterns. This concept boils down to performing a type of aggregation over a particular column. Between PostgreSQL and TimescaleDB, we have access to almost every type of aggregation function we could need. I will show some of these operators in this blog series, but I strongly encourage all of you to lookup more for your own use!

From the categorical section earlier, I mentioned that I suspect people could have different energy behavior patterns on weekdays vs. weekends, particularly in a single-family home in the US. Given my data set, I’m curious about this hypothesis and want to find the cumulative energy consumption across each day of the week.

To do so, I need to sum all the kWh data (value_kwh) in the power table, then group this data by the day of the week (day_of_week). In order to sum my data in PostgreSQL, I will use the SUM() function. Because this is an aggregation function, I will have to include something that tells the database what to sum over. Since I want to know the sum of energy over each type of day, I can specify that the sum should be grouped by the day_of_week column using the GROUP BY keyword. I also added the ORDER BY keyword so that we could look at the weekly summed usage in order of the day.

PostgreSQL code:

-- first I select the day_of_week col, then I define SUM(value_kwn) to get the sum of value_kwh col
SELECT day_of_week, SUM(value_kwh) --sum the value_sum column
FROM power_usage pu 
GROUP BY day_of_week -- group by the day_of_week col
ORDER BY day_of_week ASC; -- decided to order data by the day_of_week asc

Results:

After some quick investigation, the value 0 in the day_of_week column represents a Monday, thus my hypothesis may just be right.

day_of_week	sum
0	3849
1	3959
2	3947
3	4094
4	3987
5	4169
6	4311

Python code:

Something to note about the pandas groupby() function is that the group by column in the DataFrame will become the index column in the resulting aggregated DataFrame. This can add some extra work later on.

day_agg_power = power_df.groupby('day_of_week').agg({'value_kwh' : 'sum'})
print(day_agg_power)

Finding abnormalities in the database

Clean data is fundamental in producing accurate analysis, and abnormalities/errors can be a huge roadblock to clean data. An essential part of evaluating data is finding abnormalities to determine if an error caused them. No data set is perfect, so it is vital to hunt down any possible errors in preparation for the cleaning stage of our analysis. Let's look at one example of how to uncover issues in a dataset using our example energy data.

After looking at the raw data in my power_usage table, I found that the notes and day_of_week columns should be the same for each hour across a single day (there are 24 hourly readings each day, and each hour is supposed to have the same notes value). In my experience with data analysis, I have found that notes which need to be recorded granularly often have mistakes within them. Because of this, I wanted to investigate whether or not this pattern was consistent across all of the data.

To check this hypothesis I can use the TimescaleDB time_bucket() function, PostgreSQL’s GROUP BY keyword, and CTEs (common table expressions). While the GROUP BY keyword is likely familiar to you by now, CTEs and the time_bucket() function are not. So, before I show the query, let’s dive into these two features.

Time bucket function

The time_bucket() function allows you to take a timestamp column like startdate in the power_usage table, and “bucket” the time based on the interval of your choice. For example, startdate is a timestamp column that shows values for each hour in a day. You could use the time_bucket() function on this column to “bucket” the hourly data into daily data.

Here is an image that shows how rows of the startdate column are bucketed into one aggregate row with time_bucket(‘1 day’, startdate).

After using the time_bucket() function in my query, I will have one unique “date” value for any data recorded over a single day. Since notes and day_of_week should also be unique over each day, if I group by these columns, I should get a single set of (date, day_of_week, notes) values.

Notice that to use GROUP BY in this scenario, I just list the columns I want to group on. Also, notice that I added AS after my time_bucket() function, this keyword allows you to "rename" columns. In the results, look for the day column, as this comes directly from my rename.

PostgreSQL code:

-- select the date through time_bucket and get unique values for each 
-- (date, day_of_week, notes) set
SELECT 
    time_bucket(interval '1 day', startdate ) AS day,
    day_of_week,
    notes
FROM power_usage pu 
GROUP BY day, day_of_week, notes;

Results: Some of the rows

day	day_of_week	notes
2017-01-19 00:00:00	3	weekday
2016-10-06 00:00:00	3	weekday
2017-06-04 00:00:00	6	weekend
2019-01-03 00:00:00	3	weekday
2017-10-01 00:00:00	6	weekend
2019-11-27 00:00:00	2	weekday
2017-06-15 00:00:00	3	weekday
2016-11-16 00:00:00	2	weekday
2017-05-18 00:00:00	3	weekday
2018-07-17 00:00:00	1	weekday
2020-03-06 00:00:00	4	weekday
2018-10-14 00:00:00	6	weekend

Python code:

In my Python code, I cannot just manipulate the table to print results, I actually have to create another column in the DataFrame.

day_col = pd.to_datetime(power_df['startdate']).dt.strftime('%Y-%m-%d')
power_df.insert(0, 'date_day', day_col)
power_unique = power_df[['date_day', 'day_of_week', 'notes']].drop_duplicates()
print(power_unique)

Now that we understand the time_bucket() function a little better, let's look at CTEs and how they help me use this bucketed data to find any errors within the notes column.

CTEs or common table expressions

Getting unique sets of data only solves half of my problem. Now I want to verify if each day is truly mapped to a single day_of_week and notes pair. This is where CTE’s come in handy. With CTEs, you can build a query based on the results of others.

CTE’s use the following format 👇

WITH query_1 AS (
SELECT -- columns expressions
FROM table_name
)
SELECT --column expressions 
FROM query_1;

WITH and AS allow you to define the first query, then in the second SELECT statement, you can call the results from the first query as if it were another table in the database.

To check that each day was “mapped” to a single day_of_week and notes pair, I need to aggregate the queried time_bucket() table above based upon the date column using another PostgreSQL aggregation function COUNT(). I am doing this because each day should only count one unique day_of_week and notes pair. If the count results in two or more, this implies that one day contains multiple day_of_week and notes pairs and thus is showing abnormal data.

Additionally, I will add a HAVING statement into my query so that the output only displays rows where the COUNT(day) is greater than one. I will also throw in an ORDER BY statement in case we have many different values greater than 1.

PostgreSQL code:

WITH power_unique AS (
-- query from above, get unique set of (date, day_of_week, notes)
SELECT 
    time_bucket(INTERVAL '1 day', startdate ) AS day,
    day_of_week,
    notes
FROM power_usage pu 
GROUP BY day, day_of_week, notes
)
-- calls data from the query above, using the COUNT() agg function
SELECT day, COUNT(day) 
FROM power_unique
GROUP BY day
HAVING COUNT(day) > 1
ORDER BY COUNT(day) DESC;

Results:

day	count
2017-12-27 00:00:00	2
2020-01-03 00:00:00	2
2018-06-02 00:00:00	2
2019-06-03 00:00:00	2
2020-07-01 00:00:00	2
2016-07-21 00:00:00	2

Python code:

Because of the count aggregation, I needed to rename the column in my agg_power_unique DataFrame so that I could then sort the values.

day_col = pd.to_datetime(power_df['startdate']).dt.strftime('%Y-%m-%d')
## If you ran the previous code snippet, this next line will error since you already ran it
power_df.insert(0, 'date_day', day_col)
power_unique = power_df[['date_day', 'day_of_week', 'notes']].drop_duplicates()
agg_power_unique = power_unique.groupby('date_day').agg({'date_day' : 'count'})
agg_power_unique = agg_power_unique.rename(columns={'date_day': 'count'})
print(agg_power_unique.loc[agg_power_unique['count'] > 1].sort_values('count', ascending=False))

This query reveals that I indeed have a couple of data points that seem suspicious. Specifically, the dates [2017-12-27, 2020-01-03, 2018-06-02, 2019-06-03, 2020-07-01, 2016-07-21]. I will demonstrate how to fix these date issues in a later blog post about Cleaning techniques.

This example only shows one set of functions which helped me identify abnormal data through grouping and aggregation. You can use many other PostgreSQL and TimescaleDB functions to find other abnormalities in your data, like utilizing TimescaleDB’s approx_percentile() function (introducing this next) to find outliers in numeric columns by playing around with interquartile range calculations.

Looking at general trends

Arguably, one of the more critical aspects of evaluating your data is understanding the general trends. To do this, you need to get basic statistics on your data using functions like mean, interquartile range, maximum values, and others. TimescaleDB has created many optimized hyperfunctions to perform these very tasks.

To calculate these values, I am going to introduce the following TimescaleDB functions: approx_percentile, min_val, max_val, mean, num_vals,percentile_agg (aggregate), and tdigest (aggregate)

These hyperfunctions fall under the TimescaleDB category of two-step aggregation. Timescale designed each function to either be an aggregate or accessor function (I noted which ones above were aggregate functions). In two-step aggregation, the more programmatically taxing aggregate function is calculated first, then the accessor function is applied to it after.

For specifics on how two-step aggregation works and why we use this convention, check out David Kohn’s blog series on our hyperfunctions and two-step aggregation.

I definitely want to understand the basic trends within the power_usage table for my data set. If I plan to do any type of modeling to predict future usage trends, I need to know some basic information about what this home’s usage looks like daily.

To understand the daily power usage data distribution, I’ll need to aggregate the energy usage per day. To do this, I can use the time_bucket() function I mentioned above, along with the SUM() operator.

-- bucket the daily data using time_bucket, sum kWh over each bucketed day
SELECT 
    time_bucket(INTERVAL '1 day', startdate ) AS day,
    SUM(value_kwh)
FROM power_usage pu 
GROUP BY day;

I then want to find the 1st, 10th, 25th, 75th, 90th, and 99th percentiles, the median or 50th percentile, mean, minimum value, maximum value, number of readings in the table, and interquartile range of this data. Creating the query with a CTE simplifies the process by only calculating the sum of data once and reusing the value multiple times.

PostgreSQL:

WITH power_usage_sum AS (
-- bucket the daily data using time_bucket, sum kWh over each bucketed day
SELECT 
    time_bucket(INTERVAL '1 day', startdate ) AS day,
    SUM(value_kwh) AS sum_kwh
FROM power_usage pu 
GROUP BY day
)
-- using two-step aggregation functions to find stats
SELECT approx_percentile(0.01,percentile_agg(sum_kwh)) AS "1p",
approx_percentile(0.10,percentile_agg(sum_kwh)) AS "10p",
approx_percentile(0.25,percentile_agg(sum_kwh)) AS "25p",
approx_percentile(0.5,percentile_agg(sum_kwh)) AS "50p",
approx_percentile(0.75,percentile_agg(sum_kwh)) AS "75p",
approx_percentile(0.90,percentile_agg(sum_kwh)) AS "90p",
approx_percentile(0.99,percentile_agg(sum_kwh)) AS "99p",
min_val(tdigest(100, sum_kwh)),
max_val(tdigest(100, sum_kwh)),
mean(percentile_agg(sum_kwh)),
num_vals(percentile_agg(sum_kwh)),
-- you can use subtraction to create an output for the IQR
approx_percentile(0.75,percentile_agg(sum_kwh)) - approx_percentile(0.25,percentile_agg(sum_kwh)) AS iqr
FROM power_usage_sum pus;

Results:

1p	10p	25p	50p	75p	90p	99p	min_val	max_val	mean	num_vals	iqr
0.0	4.0028	6.9936	16.0066	28.9914	38.9781	56.9971	0.0	73.0	18.9025	1498.0	21.9978

Python:

Something that really stumped me when initially writing this code snippet was that I had to use astype(float) on my value_kwh column to use describe. I have probably spent the combined time of a day over my life trying to deal with value types being incompatible with certain functions. This is another reason why I enjoy data munging with the intuitive functionality of PostgreSQL and TimescaleDB; these types of problems just happen less often. And let me tell you, the faster and painless data munging is the happier I am!

agg_power = power_df.groupby('date_day').agg({'value_kwh' : 'sum'})
# need to make the value_kwh column the right data type
agg_power.value_kwh = agg_power.value_kwh.astype(float)
describe = agg_power.value_kwh.describe()
percentiles = agg_power.value_kwh.quantile([.01, .1, .9, .99])
q75, q25 = np.percentile(agg_power['value_kwh'], [75 ,25])
iqr = q75 - q25
print(describe, percentiles, iqr)

Another technique you may want to use for accessing the distribution of data in a column is a histogram. Generally, creating an image is where Python and other tools shine. However, I often need to glance at a histogram to check for any blatant anomalies when evaluating data. While this one technique in TimescaleDB may not be as simple as the Python solution, I can still do this directly in my database, which can be convenient.

To create a histogram in the database, we will need to use the TimescaleDB histogram() function, unnest(), generate_series(), repeat(), and CTE’s.

The histogram() function takes in the column you want to analyze and produces an array object which contains the frequency values across the number of buckets plus two (one additional bucket for values below the lowest bucket and above the highest bucket). You can then use PostgreSQL’s unnest() function to break up the array into a single column with rows equal to two plus the number of specified buckets.

Once you have a column with bucket frequencies, you can then create a histogram “image” using the PostgreSQL repeat() function. The first time I saw someone use the repeat() function in this way was in Haki Benita’s blog post, which I recommend reading if you are interested in learning more PostgreSQL analytical techniques. The repeat() function essentially creates a string that repeats chosen characters a specified number of times. To use the histogram frequency values, you just input the unnested histogram in for the repeating argument.

Additionally, I find it useful to know the approximate starting values for each bucket in the histogram. This gives me a better picture of what values are occurring when. To approximate the bin values, I use the PostgreSQL generate_series() function along with some algebra,

(generate_series(-1, [number_of_buckets]) * [max_val - min_val]::float/[number_of_buckets]::float) + [min_val]

When I put all these techniques together, I am able to get a histogram with the following,

PostgreSQL:

WITH power_usage_sum AS (
-- bucket the daily data using time_bucket, sum kWh over each bucketed day
SELECT 
    time_bucket(INTERVAL '1 day', startdate ) AS day,
    SUM(value_kwh) AS sum_kwh
FROM power_usage pu 
GROUP BY day
),
histogram AS (
-- I input the column = sum_kwh, the min value = 28, max value = 90, and number of buckets = 25
SELECT histogram(sum_kwh, 0, 73, 30)
FROM power_usage_sum w 
)
SELECT 
-- I use unnest to create the first column
   unnest(histogram) AS count, 
-- I use my approximate bucket values function
   (generate_series(-1, 30) * 73::float/30::float) + 0 AS approx_bucket_start_val,
-- I then use the repeat function to display the frequency
   repeat('■', unnest(histogram)) AS frequency
FROM histogram;

Results:

count	approx_bucket_start_val	frequency
0	-2.433333333333333
83	0.0	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
104	2.433333333333333	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
207	4.866666666666666	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
105	7.3	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
150	9.733333333333333	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
76	12.166666666666666	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
105	14.6	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
62	17.033333333333335	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
48	19.466666666666665	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
77	21.9	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
35	24.333333333333332	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
83	26.766666666666666	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
42	29.2	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
72	31.633333333333333	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
46	34.06666666666667	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
51	36.5	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
39	38.93333333333333	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
32	41.36666666666667	■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
24	43.8	■■■■■■■■■■■■■■■■■■■■■■■■
16	46.233333333333334	■■■■■■■■■■■■■■■■
17	48.666666666666664	■■■■■■■■■■■■■■■■■
4	51.1	■■■■
3	53.53333333333333	■■■
5	55.96666666666667	■■■■■
4	58.4	■■■■
5	60.833333333333336	■■■■■
1	63.266666666666666	■
0	65.7
1	68.13333333333334	■
0	70.56666666666666
1	73.0	■

Python:

This Python code is definitively better. It’s simple and relatively painless. I wanted to show this comparison to provide an option for displaying a histogram directly in your database vs. having to pull the data into a pandas DataFrame then displaying it. Doing the histogram in the database just helps me to keep focus while evaluating the data.

plt.hist(agg_power.value_kwh, bins=30)
plt.show()

Wrap up

Hopefully, after reading through these various evaluating techniques, you feel more comfortable with exploring some of the possibilities that PostgreSQL and TimescaleDB provide. Evaluating data directly in the database often saved me time without sacrificing any functionality. If you are looking to save time and effort while evaluating your data for analysis, definitely consider using PostgreSQL and TimescaleDB.

In my next posts, I will go over techniques to clean and transform data using PostgreSQL and TimescaleDB. I'll then take everything we've learned together to benchmark data munging tasks in PostgreSQL and TimescaleDB vs. Python and pandas. The final blog post will walk you through the full process on a real dataset by conducting deep-dive data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).

Functionality Glossary

SELECT
FROM
ORDER BY
DESC
ASC
LIMIT
DISTINCT
GROUP BY
SUM()
time_bucket(<time_interval>, <time_col>)
CTE’s WITH AS
COUNT()
approx_percentile()
min_val()
max_val()
mean()
num_vals()
percentile_agg() [aggregate]
tdigest() [aggregate]
histogram()
unnest()
generate_series()
repeat()

Speeding up data analysis with TimescaleDB and PostgreSQL

mirandaauhl — Fri, 17 Sep 2021 14:15:54 +0000

Common data analysis tools and “the problem”
Data analysis issue #1: storing and accessing data
Data analysis issue #2: maximizing analysis speed and computation efficiency (the bigger the dataset, the bigger the problem)
Data analysis issue #3: storing and maintaining scripts for data analysis
Data analysis issue #4: easily utilizing new or additional technologies
Wrapping up

Time-series data is everywhere, and it drives decision-making in every industry. Time-series data collectively represents how a system, process, or behavior changes over time. Understanding these changes helps us to solve complex problems across numerous industries, including observability, financial services, Internet of Things, and even professional football.

Depending on the type of application they’re building, developers end up collecting millions of rows of time-series data (and sometimes millions of rows of data every day or even every hour!). Making sense of this high-volume, high-fidelity data takes a particular set of data analysis skills that aren’t often exercised as part of the classic developer skillset. To perform time-series analysis that goes beyond basic questions, developers and data analysts need specialized tools, and as time-series data grows in prominence, the efficiency of these tools becomes even more important.

Often, data analysts’ work can be boiled down to evaluating, cleaning, transforming, and modeling data. In my experience, I’ve found these actions are necessary for me to gain understanding from data, and I will refer to this as the “data analysis life cycle” throughout this post.

Clean -> Transform -> Model" width="800" height="202">

Excel, R, and Python are arguably some of the most commonly used data analysis tools, and, while they are all fantastic tools, they may not be suited for every job. Speaking from experience, these tools can be especially inefficient for “data munging” at the early stages of the lifecycle; specifically, the evaluating data, cleaning data, and transforming data steps involved in pre-modeling work.

As I’ve worked with larger and more complex datasets, I’ve come to believe that databases built for specific types of data - such as time-series data - are more effective for data analysis.

For background, TimescaleDB is a relational database for time-series data. If your analysis is based on time-series datasets, TimescaleDB can be a great choice not only for its scalability and dependability but also for its relational nature. Because TimescaleDB is packaged as an extension to PostgreSQL, you’ll be able to look at your time-series data alongside your relational data and get even more insight. (I recognize that as a Developer Advocate at Timescale, I might be a little biased 😊…)

In this four-part blog series, I will discuss each of the three data munging steps in the analysis lifecycle in-depth and demonstrate how to use TimescaleDB as a powerful tool for your data analysis.

In this introductory post, I'll explore a few of the common frustrations that I experienced with popular data analysis tools, and from there, dive into how I’ve used TimescaleDB to help alleviate each of those pain points.

In future posts we'll look at:

How TimescaleDB data analysis functionality can replace work commonly performed in Python and pandas
How TimescaleDB vs. Python and pandas compare (benchmarking a standard data analysis workflow)
How to use TimescaleDB to conduct an end-to-end, deep-dive data analysis, using real yellow taxi cab data from the New York City Taxi and Limousine Commission (NYC TLC).

If you are interested in trying out TimescaleDB and PostgreSQL functionality right away, sign up for a free 30-day trial or install and manage it on your instances. (You can also learn more by following one of our many tutorials.)

Common data analysis tools and “the problem”

As we’ve discussed, the three most popular tools used for data analysis are Excel, R, and Python. While they are great tools in their own right, they are not optimized to efficiently perform every step in the analysis process.

In particular, most data scientists (including myself!) struggle with similar issues as the amount of data grows or the same analysis needs to be redone month after month.

Some of these struggles include:

Data storage and access: Where is the best place to store and maintain my data for analysis?
Data size and its influence on the analysis: How can I improve efficiency for data munging tasks, especially as data scales?
Script storage and accessibility: What can I do to improve data munging script storage and maintenance?
Easily utilizing new technologies: How could I set up my data analysis toolchain to allow for easy transitions to new technologies?

So buckle in, keep your arms and legs in the vehicle at all times, and let’s start looking at these problems!

Data analysis issue #1: storing and accessing data

To do data analysis, you need access to… data.

via GIPHY

Managing where that data lives, and how easily you can access it is the preliminary (and often most important) step in the analysis journey. Every time I begin a new data analysis project, this is often where I run into my first dilemma. Regardless of the original data source, I always ask “where is the best place to store and maintain the data as I start working through the data munging process?”

Although it's becoming more common for data analysts to use databases for storing and querying data, it's still not ubiquitous. Too often, raw data is provided in a stream of CSV files or APIs that produce JSON. While this may be manageable for smaller projects, it can quickly become overwhelming to maintain and difficult to manage from project to project.

For example, let’s consider how we might use Python as our data analysis tool of choice.

While using Python for data analysis, I have the option of ingesting data through files/APIs OR a database connection.

If I used files or APIs for querying data during analysis, I often faced questions like:

Where are the files located? What happens if the URL or parameters change for an API?
What happens if duplicate files are made? And what if updates are made to one file, and not the other?
How do I best share these files with colleagues?
What happens if multiple files depend on one another?
How do I prevent incorrect data from being added to the wrong column of a CSV? (ie. a decimal where a string should be)
What about very large files? What is the ingestion rate for a 10MB, 100MB, 1GB, 1TB sized file?

After running into these initial problems project after project, I knew there had to be a better solution. I knew that I needed a single source of truth for my data – and it started to become clear that a specialized SQL database might be my answer!

Now, let’s consider if I were to connect to TimescaleDB.

By importing my time-series data into TimescaleDB, I can create one source of truth for all of my data. As a result, collaborating with others becomes as simple as sharing access to the database. Any modifications to the data munging process within the database means that all users have access to the same changes at the same time, opposed to parsing through CSV files to verify I have the right version.

Additionally, databases can typically handle much larger data loads than a script written in Python or R. TimescaleDB was built to house, maintain, and query terabytes of data efficiently and cost-effectively (both computationally speaking AND for your wallet). With features like continuous aggregates and native columnar compression, storing and analyzing years of time-series data became efficient while still being easily accessible.

In short, managing data over time, especially when it comes from different sources, can be a nightmare to maintain and access efficiently. But, it doesn’t have to be.

Data analysis issue #2: maximizing analysis speed and computation efficiency (the bigger the dataset, the bigger the problem)

Excel, R, and Python are all capable of performing the first three steps of the data analysis “lifecycle”: evaluating, cleaning, and transforming data. However, these technologies are not generally optimized for speed or computational efficiency during the process.

In numerous projects over the years, I’ve found that as the size of my dataset increased, the process of importing, cleaning, and transforming it became more difficult, time-consuming, and, in some cases impossible. For Python and R, parsing through large amounts of data seemed to take forever, and Excel would simply crash once hitting millions of rows.

Things became especially difficult when I needed to create additional tables for things like aggregates or data transformations: some lines of code could take seconds or, in extreme cases, minutes to run depending on the size of the data, the computer I was using, or the complexity of the analysis.

While seconds or minutes may not seem like a lot, it adds up and amounts to hours or days of lost productivity when you’re performing analysis that needs to be run hundreds or thousands of times a month!

To illustrate, let’s look at a Python example once again.

Say I was working with this IoT data set taken from Kaggle. The set contains two tables, one specifying energy consumption for a single home in Houston Texas, and the other documenting weather conditions.

To run through analysis with Python, the first steps in my analysis would be to pull in the data and observe it.

When using Python to do this, I would run code like this 👇

import psycopg2
import pandas as pd
import configparser


## use config file for database connection information
config = configparser.ConfigParser()
config.read('env.ini')

## establish conntection
conn = psycopg2.connect(database=config.get('USERINFO', 'DB_NAME'), 
                        host=config.get('USERINFO', 'HOST'), 
                        user=config.get('USERINFO', 'USER'), 
                        password=config.get('USERINFO', 'PASS'), 
                        port=config.get('USERINFO', 'PORT'))

## define the queries for selecting data out of our database                        
query_weather = 'select * from weather'
query_power = 'select * from power_usage'

## create cursor to extract data and place it into a DataFrame
cursor = conn.cursor()
cursor.execute(query_weather)
weather_data = cursor.fetchall()
cursor.execute(query_power)
power_data = cursor.fetchall()
## you will have to manually set the column names for the data frame
weather_df = pd.DataFrame(weather_data, columns=['date','day','temp_max','temp_avg','temp_min','dew_max','dew_avg','dew_min','hum_max','hum_avg','hum_min','wind_max','wind_avg','wind_min','press_max','press_avg','press_min','precipit','day_of_week'])
power_df = pd.DataFrame(power_data, columns=['startdate', 'value_kwh', 'day_of_week', 'notes'])
cursor.close()

print(weather_df.head(20))
print(power_df.head(20))

Altogether, this code took 2.718 seconds to run using my 2019 MacBook Pro laptop with 32GB memory.

But, what about if I run this equivalent script with SQL in the database?

select * from weather
select * from power_usage

startdate	value_kwh	day_of_week	notes
2016-01-06 01:00:00	1	2	weekday
2016-01-06 02:00:00	1	2	weekday
2016-01-06 03:00:00	1	2	weekday
2016-01-06 04:00:00	1	2	weekday
2016-01-06 05:00:00	0	2	weekday
2016-01-06 06:00:00	0	2	weekday
2016-01-06 07:00:00	0	2	weekday
2016-01-06 08:00:00	0	2	weekday
2016-01-06 09:00:00	0	2	weekday
2016-01-06 10:00:00	0	2	weekday
2016-01-06 11:00:00	1	2	weekday
2016-01-06 12:00:00	0	2	weekday
2016-01-06 13:00:00	0	2	weekday
2016-01-06 14:00:00	0	2	weekday
2016-01-06 15:00:00	0	2	weekday
2016-01-06 16:00:00	1	2	weekday
2016-01-06 17:00:00	4	2	weekday

This query only took 0.342 seconds to run, almost 8x faster when compared to the Python script.

This time difference makes a lot of sense when we consider that Python must connect to a database, then run the SQL query, then parse the retrieved data, and then import it into a DataFrame. While almost three seconds is fast, this extra time for processing adds up as the script becomes more complicated and more data munging tasks are added.

Pulling in the data and observing it is only the beginning of my analysis! What happens when I need to perform a transforming task, like aggregating the data?

For this dataset, when we look at the power_usage table - as seen above - kWh readings are recorded every hour. If I want to do daily analysis, I have to aggregate the hourly data into “day buckets”.

If I used Python for this aggregation, I could use something like 👇

# sum power usage by day, bucket by day
## create column for the day 
day_col = pd.to_datetime(power_df['startdate']).dt.strftime('%Y-%m-%d')
power_df.insert(0, 'date_day', day_col)
agg_power = power_df.groupby('date_day').agg({'value_kwh' : 'sum', 'day_of_week' : 'unique', 'notes' : 'unique' })
print(agg_power)

...which takes 0.49 seconds to run (this does not include the time for importing our data).

Alternatively, with the TimescaleDB time_bucket() function, I could do this aggregation directly in the database using the following query 👇

select 
    time_bucket(interval '1 day', startdate ) as day,
    sum(value_kwh),
    day_of_week,
    notes
from power_usage pu 
group by day, day_of_week, notes
order by day

day	sum	day_of_week	notes
2016-01-06 00:00:00	27	2	weekday
2016-01-07 00:00:00	42	3	weekday
2016-01-08 00:00:00	51	4	weekday
2016-01-09 00:00:00	50	5	weekend
2016-01-10 00:00:00	45	6	weekend
2016-01-11 00:00:00	22	0	weekday
2016-01-12 00:00:00	12	1	weekday
2016-02-06 00:00:00	32	5	weekend
2016-02-07 00:00:00	62	6	weekend
2016-02-08 00:00:00	48	0	weekday
2016-02-09 00:00:00	23	1	weekday
2016-02-10 00:00:00	24	2	weekday

...which only takes 0.087 seconds and is over 5x faster than the Python script.

You can start to see a pattern here.

As mentioned above, TimescaleDB was created to efficiently query and store time-series data. But simply querying data only scratches the surface of the possibilities TimescaleDB and PostgreSQL functionality provides.

TimescaleDB and PostgreSQL offer a wide range of tools and functionality that can replace the need for additional tools to evaluate, clean, and transform your data. Some of the TimescaleDB functionality includes continuous aggregates, compression, and hyperfunctions; all of which allow you to do nearly all data munging tasks directly within the database.

When I performed the evaluating, cleaning, and transforming steps of my analysis directly within TimescaleDB, I cut out the need to use additional tools - like Excel, R, or Python - for data munging tasks. I could pull cleaned and transformed data, ready for modeling, directly into Excel, R, or Python.

Data analysis issue #3: storing and maintaining scripts for data analysis

Another potential downside of exclusively using Excel, R, or Python for the entire data analysis workflow, is that all of the logic for analyzing the data is contained within a script file. Similar to the issues of having many different data sources, maintaining script files can be inconvenient and messy.

Some common issues that I - and many data analysts - run into include:

*Losing files
*Unintentionally creating duplicate files
*Changing or updating some files but not others
*Needing to write and run scripts to access transformed data (see below example)
*Spending time re-running scripts whenever new raw data is added (see below example)

While you can use a code repository to overcome some of these issues, it will not fix the last two.

Let’s consider our Python scenario again.

Say that I used a Python script exclusively for all my data analysis tasks. What happens if I need to export my transformed data to use in a report on energy consumption in Texas?

Likely, I would have to add some code within the script to allow for exporting the data and then run the script again to actually export it. Depending on the content of the script and how long it takes to transform the data, this could be pretty inconvenient and inefficient.

What if I also just got a bunch of new energy usage and weather data? For me to incorporate this new raw data into existing visualizations or reports, I would need to run the script again and make sure that all of my data munging tasks run as expected.

Database functions, like continuous aggregates and materialized views, can create transformed data that can be stored and queried directly from your database without running a script. Additionally, I can create policies for continuous aggregates to regularly keep this transformed data up-to-date any time raw data is modified. Because of these policies, I wouldn't have to worry about running scripts to re-transform data for use, making access to updated data efficient. With TimescaleDB, many of the data munging tasks in the analysis lifecycle that you would normally do within your scripts can be accomplished using built-in TimescaleDB and PostgreSQL functionality.

Data analysis issue #4: easily utilizing new or additional technologies

Finally, the last step in the data analysis lifecycle: modeling. If I wanted to use a new tool or technology to create a visualization, it was difficult to easily take my transformed data and use it for modeling or visualizations elsewhere.

Python, R, and Excel are all pretty great for their visualization and modeling capabilities. However, what happens when your company or team wants to adopt a new tool?

In my experience, this often means either adding on another step to the analysis process, or rediscovering how to perform the evaluating, cleaning, and transforming steps within the new technology.

For example, in one of my previous jobs, I was asked to convert a portion of my analysis into Power BI for business analytics purposes. Some of the visualizations my stakeholders wanted required me to access transformed data from my Python script. At the time, I had the option to export the data from my Python script or figure out how to transform the data in Power BI directly. Both options were not ideal and were guaranteed to take extra time.

When it comes to adopting new visualization or modeling tools, using a database for evaluating, cleaning, and transforming data can again work in your favor. Most visualization tools - such as Grafana, Metabase, or Power BI - allow users to import data from a database directly.

Since I can do most of my data munging tasks within TimescaleDB, adding or switching tools - such as using Power BI for dashboard capabilities - becomes as simple as connecting to my database, pulling in the munged data, and using the new tool for visualizations and modeling.

Wrapping up

In summary, Excel, R, and Python are all great tools to use for analysis, but may not be the best tools for every job. Case in point: my struggles with time-series data analysis, especially on big datasets.

With TimescaleDB functionality, you can house your data and perform the evaluating, cleaning, and transforming aspects of data analysis, all directly within your database – and solve a lot of common data analysis woes in the process (which I’ve - hopefully! - demonstrated in this post)

In the blog posts to come, I’ll explore TimescaleDB and PostgreSQL functionality compared to Python, benchmark TimescaleDB performance vs. Python and pandas for data munging tasks, and conduct a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).

If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our community Slack, where you'll find an active community of time-series enthusiasts and various Timescale team members (including me!).

Until next time!

This post was originally written by Miranda Auhl and published on the https://blog.timescale.com on September 9, 2021

Hacking NFL data with PostgreSQL, TimescaleDB, and SQL

mirandaauhl — Fri, 30 Jul 2021 18:55:02 +0000

The NFL dataset
Accessing the data
Let's start exploring!
The power of SQL
Faster insights with PostgreSQL and TimescaleDB
Faster queries with TimescaleDB continuous aggregates
Advanced SQL data analysis with TimescaleDB hyperfunctions
Where can the data take you?

Learn how to use time-series data provided by the NFL to uncover valuable insights into many player performance metrics – and ways to apply the same methods to improve your fantasy league team, your knowledge of the game, or your viewing experience - all with PostgreSQL, standard SQL, and freely available extensions.

Time-series data is everywhere, including, much to our surprise, the world of professional sports. At Timescale, we're always looking for fun ways to showcase the expanding reach of time-series data. Stock, cryptocurrency, IoT, and infrastructure metrics data are relatively common and widely understood time-series data scenarios. Head to Twitter on any given day, search for #timeseries or #TimescaleDB, and you're sure to find questions about high-frequency trading or massive scale observability data with tools like Prometheus.

You can imagine our excitement, then, when we happened upon the NFL Big Data Bowl, an annual competition that encourages the data science community to use historical player position and play data to create machine learning models.

Did the NFL really give access to 18+ million rows of detailed play data from every regular season NFL game?

For background, the National Football League (NFL) is the US professional sports league for American football, and the NFL season is followed by tens of millions of people, culminating in the annual Super Bowl (which attracts 100M+ global viewers, whether for the game or for the commercials).

Each NFL game takes place as a series of “plays,” in which the two teams try to score and prevent the other team from scoring. There are approximately 200 plays per game, with up to 15 games a week during the regular season. A healthy amount of data, but nothing unmanageable.

So, at first glance, football game metrics might not immediately jump out as anything special.

But then the NFL did something pretty ambitious and amazing.

All NFL players are equipped with RFID chips that track players’ position, speed, and various other metrics, which teams use to identify trends, mitigate risks, and continuously optimize. The NFL started tracking and storing data for every player on the field, for every play, for every game.

As a result, we now have access to a very detailed analysis of exactly how a play unfolded, how quickly various players accelerated during each play, and the play’s outcome. A traditional view of play-by-play metrics is “down and distance” and the result of the play (yards gained, whether or not there was a score, and so on). With the NFL’s dataset, we're able to mine approximately 100 data points at 100-millisecond intervals throughout the play to see speed, distance, involved players, and much more.

This isn’t ordinary data. This is time-series data. Time-series data is a sequence of data points collected over time intervals, giving us the ability to track changes over time. In the case of the NFL’s dataset, we have time-series data that represents how a play changes, including the locations of the players on the field, the location of the ball, the relative acceleration of players in the field of play, and so much more.

Time-series data comes at you fast, sometimes generating millions of data points per second (read more about time-series data). Because of the sheer volume and rate of information, time-series data can already be complex to query and analyze, which is why we built TimescaleDB, a multi-node, petabyte-scale, completely free relational database for time-series.

We couldn't pass up the opportunity to look at the NFL dataset with TimescaleDB, exploring ways we could peer deeper into player performance in hopes of providing insights about overall player performance in the coming season.

Read on for more information about the NFL’s dataset and how you can start using it, plus some sample queries to jumpstart your analysis. They may help you get more enjoyment out of the game.

If you’d like to get started with NFL data, you can spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days. The instructions later in this post will take you through how to ingest the data and start using it for analysis.

If you’re new to time-series data or just have some questions you’d like to ask about the dataset, join our public Slack community, where you’ll find Timescale team members and thousands of time-series enthusiasts, and we’ll be happy to help you.

The NFL dataset

Over the last few years, the NFL and Kaggle have collaborated on the NFL Big Data Bowl. The goal is to use historical data to answer a predetermined genre of questions, typically producing a machine learning model that can help predict the outcome of certain plays during regular season games.

Although the 2020/2021 contest is over, the sample dataset they provided from a prior season is still available for download and analysis. The 2020/2021 competition focused on pass play defense efficiency; therefore, only the tracking data for offensive and defensive "playmakers" is available in the dataset. No offensive or defensive linemen data is included. (You can read more about last year’s winners.)

(Keep watching the NFL website for more information on the next Big Data Bowl.)

Accessing the data

For the purposes of this blog post and accompanying tutorial, we will use the sample data provided by the NFL. This data is from the 2018 NFL season and is available as CSV files, including game-specific data and week-by-week tracking data for each player involved in the "offensive" part of the pass play. Contest participants in the next season of the contest will have access to new weekly game data.

This data is also very relational in nature, which means that SQL is a great medium to start gleaning value – without the need for Jupyter notebooks, other data science specific languages (like Python or R), or additional toolsets.

If you want to follow along - or recreate! - the queries we go through below, follow our tutorial to set up the tables, ingest data, and start analyzing data in TimescaleDB. For those unfamiliar with TimescaleDB, it’s built on PostgreSQL, so you’ll find that all of our queries are standard SQL. If you know SQL, you’ll know how to do everything here. (Some of the more advanced query examples we provide require our new, advanced hyperfunctions, which come pre-installed with any Timescale Forge instance.)

Let's start exploring!

We've provided the steps needed to ingest the dataset into TimescaleDB in the accompanying tutorial, so we won’t go into that here.

The NFL dataset includes the following data:

Games: all relevant data about each game of the regular season, including date, teams, time, and location
Players: information on each player, including what team they play for and their originating college
Plays: a wealth of data about each pass play in the game. Helpful fields include the down, description of the play that happened, line of scrimmage, and total offensive yardage, among other details.
Week [1-17]: for each week of the season, the NFL provides a new CSV file with the tracking data of every player, for every play (pass plays for this data). Interesting fields include X/Y position data (relative to the football field) every few hundred milliseconds throughout each play, player acceleration, and the "type" of a route that was taken. (In our tutorial, this data is imported into the tracking table and totals almost 20 million rows of time-series data.)

In addition to the NFL dataset, we also provide some extra data from Wikipedia that includes game scores and stadium conditions for each game, which you can load as part of the tutorial. With other time-series databases, it can be difficult to combine your time-series data with any other data you may have on hand (see our TimescaleDB vs. InfluxDB comparison for reference).

Because TimescaleDB is PostgreSQL with time-series super powers, it supports JOINS, so any extra relational data you want to add for deeper analysis is just a SQL query away. In our case, we’re able to combine the NFL’s play-by-play data along with weather data for each stadium.

Once you have the data ready, the world of NFL playmakers is at your fingertips, so let’s get started!

The power of SQL

Year after year, we see SQL listed as one of the most popular languages among developers on the StackOverflow survey. Sometimes, however, we can be lured into thinking that the only way to gain insights from relational data is to query it with powerful data analytics tools and languages, create data frames, and use specialized regression algorithms before we can do anything productive.

SQL, it often feels, is only useful for getting and storing data in applications and that we need to leave the "heavy lifting" of analysis to more mature tools.

Not so! SQL can data munge with the best of them! Let's look at a first, quick example.

Average yards per position, per game

For this first example, we'll query the tracking table (the player movement data from all 17 weeks of games) and join to the game table to determine the number of yards per player position, per game.

The results give you a quick overview of how many yards different positions ran throughout each game. You could use this later to compare specific players to see how they compared, more or less yards, to that total.

WITH total_position_yards AS (
    SELECT sum(dis) position_yards, POSITION, gameid FROM tracking t 
    GROUP BY POSITION, gameid)
SELECT avg(position_yards), position, game_date
FROM game g
INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid
WHERE POSITION IN ('QB','RB','WR','TE')
GROUP BY game_date, POSITION;

Number of plays by offensive player

As a season progresses and players get injured (or traded), it's helpful to know which of the available players have more playing experience, rather than those that have been sitting on the sideline for most of the season. Players with more playing time are often able to contribute to the outcome of the game.

This query finds all players that were on the offense for any play and counts how many total passing plays they have been a part of, ordered by total passing plays descending.

WITH snap_events AS (
-- Create a table that filters the play events to show only snap plays
-- and display the players team information
 SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
   CASE
     WHEN t.team = 'away' THEN g.visitor_team
     WHEN t.team = 'home' THEN g.home_team
     ELSE NULL
     END AS team_name
 FROM tracking t
 LEFT JOIN game g ON t.gameid = g.game_id
 WHERE t.event IN ('snap_direct','ball_snap')
)
-- Count these events & filter results to only display data when the player was
-- on the offensive
SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name
FROM snap_events a
LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
LEFT JOIN player pl ON a.player_id = pl.player_id
WHERE a.team_name = p.possessionteam
GROUP BY a.player_id, pl.display_name, a.team_name
ORDER BY play_count DESC;

player_id	display_name	play_count	team_name
2506109	Ben Roethlisberger	725	PIT
2558149	JuJu Smith-Schuster	691	PIT
2533031	Andrew Luck	683	IND
2508061	Antonio Brown	679	PIT
310	Matt Ryan	659	ATL
2506363	Aaron Rodgers	656	GB
2505996	Eli Manning	639	NYG
2543495	Davante Adams	630	GB
2540158	Zach Ertz	629	PHI
2532820	Kirk Cousins	621	MIN
79860	Matthew Stafford	619	DET
2504211	Tom Brady	613	NE

If you’re familiar with American football, you might know that players are substituted in and out of the game based on game conditions. Stronger, larger players may play in some situations, while faster, more agile players may play in others.

Quarterbacks, however, are the most “important” players on the field, and tend to play more than others. However, by omitting quarterbacks, we can get a deeper insight into players across all other positions.

WITH snap_events AS (
-- Create a table that filters the play events to show only snap plays
-- and display the players team information
 SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
   CASE
     WHEN t.team = 'away' THEN g.visitor_team
     WHEN t.team = 'home' THEN g.home_team
     ELSE NULL
     END AS team_name
 FROM tracking t
 LEFT JOIN game g ON t.gameid = g.game_id
 WHERE t.event IN ('snap_direct','ball_snap')
)
-- Count these events & filter results to only display data when the player was
-- on the offensive
SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name, pl."position"
FROM snap_events a
LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
LEFT JOIN player pl ON a.player_id = pl.player_id
WHERE a.team_name = p.possessionteam AND pl."position" != 'QB'
GROUP BY a.player_id, pl.display_name, a.team_name, pl."position"
ORDER BY play_count DESC;

So, now we can see the non-quarterbacks who are on offense the most in a season:

player_id	display_name	play_count	team_name	position
2558149	JuJu Smith-Schuster	691	PIT	WR
2508061	Antonio Brown	679	PIT	WR
2543495	Davante Adams	630	GB	WR
2540158	Zach Ertz	629	PHI	TE
2541785	Adam Thielen	612	MIN	WR
2543468	Mike Evans	610	TB	WR
2555295	Sterling Shepard	610	NYG	WR
2540169	Robert Woods	604	LA	WR
2552600	Nelson Agholor	604	PHI	WR
2543488	Jarvis Landry	592	CLE	WR
2540165	DeAndre Hopkins	587	HOU	WR
2543498	Brandin Cooks	581	LA	WR

Sack percentage by quarterback on passing plays

We can start to go a little deeper by extracting specific data from the tracking table and layering queries on top of it to make correlations. One piece of information that might be helpful in your analysis is knowing which quarterbacks are sacked most often during passing plays. In football, a “sack” is a negative play for the offense, and quarterbacks who get sacked more often tend to be lower performers overall.

Once you know those players, you could expand your analysis to see if they are sacked more on specific types of plays (shotgun formation) or maybe if sacks occur more often in a specific quarter of the game (maybe the fourth quarter because the offensive line is more tired, or the team tends to be behind late in games and must pass more often).

Queries like this can quickly show you quarterbacks that are more likely to get sacked, particularly when they play a strong defensive team. To get started, we wanted to find the sack percentage of each quarterback based on the total number of pass plays they were involved in during the regular season. To do that we approached the tracking data by layering on Common Table Expressions so that each query could build upon previous results.

First, we select the distinct list of all plays, for each quarterback (qb_plays). The reason we do a SELECT DISTINCT… is because the tracking table holds multiple entries for each player, for each play. We just need one row for each play, for each quarterback.

With this result, we can then count the number of total plays per quarterback (total_qb_plays), the total number of games each quarterback played (qb_games) and then finally the number of pass plays the quarterback was a part of that resulted in a sack (sacks).

With that data in hand, we can finally query all of the values, do a percentage calculation, and order it by the total sack count.

WITH qb_plays AS (
    SELECT DISTINCT ON (POSITION, playid, gameid) POSITION, playid, player_id, gameid 
    FROM tracking t 
    WHERE POSITION = 'QB'
),
total_qb_plays AS (
    SELECT count(*) play_count, player_id FROM qb_plays
    GROUP BY player_id
),
qb_games AS (
    SELECT count(DISTINCT gameid) game_count, player_id FROM qb_plays 
    GROUP BY player_id
),
sacks AS (
    SELECT count(*) sack_count, player_id 
    FROM play p
    INNER JOIN qb_plays ON p.gameid = qb_plays.gameid AND p.playid = qb_plays.playid
    WHERE p.passresult = 'S'
    GROUP BY player_id
)
SELECT play_count, game_count, sack_count, (sack_count/play_count::float)*100 sack_percentage, display_name FROM total_qb_plays tqp
INNER JOIN qb_games qg ON tqp.player_id = qg.player_id
LEFT JOIN sacks s ON s.player_id = qg.player_id
INNER JOIN player ON tqp.player_id = player.player_id
ORDER BY sack_count DESC NULLS last;

If you're an ardent football fan, the results from 2018 probably don't surprise you.

play_count	game_count	sack_count	sack_percentage	display_name
579	16	65	11.23	Deshaun Watson
602	16	55	9.14	Dak Prescott
611	16	53	8.67	Derek Carr
656	16	49	7.47	Aaron Rodgers
462	15	48	10.39	Russell Wilson
639	16	47	7.36	Eli Manning
448	14	45	10.04	Josh Rosen
659	16	43	6.53	Matt Ryan
386	14	43	11.14	Marcus Mariota
619	16	41	6.62	Matthew Stafford
621	15	38	6.12	Kirk Cousins
324	11	37	11.42	Ryan Tannehill
447	11	36	8.05	Carson Wentz

Of course, there are a few quarterbacks that always seem to have a way of avoiding a sack.

play_count	game_count	sack_count	sack_percentage	display_name
725	16	25	3.45	Ben Roethlisberger
682	16	22	3.23	Andrew Luck
613	16	21	3.43	Tom Brady

Now, let’s try some more “advanced” queries and analyses.

Faster insights with PostgreSQL and TimescaleDB

So far, the queries we've shown are interesting and help provide insights to various players throughout the season – but if you were looking closely, they're all regular SQL statements.

Examining a season of NFL tracking data isn't like typical time-series data, however. Most of the queries we want to perform need to examine all 20 million rows in some way.

This is where a tool that's been built for time-series analysis, even when the data isn't typical time-series data, can significantly improve your ability to examine the data and save money at the same time.

Faster queries with TimescaleDB continuous aggregates

We noticed that we often needed to build queries that started with the tracking table, filtering data by specific players, positions, and games. Part of the reason is that the play table doesn't list all of the players who were involved in a particular play. As a result, we need to cross-reference the tracking table to identify the players who were involved in any given play.

The first example query we demonstrated - “average yards per position, per game” - is a good example of this. The query begins by summing all yards, by position, for each game.

This means that every row in tracking has to be read and aggregated before we can do any other analysis. Scanning those 20 million rows is pretty boring, repetitive, and slow work – especially compared to the analysis we want to do!

On our small test instance, the "average yards" query takes about 8 seconds to run. We could increase the size of the instance (which will cost us more money), or we could be smarter about how we query the data (which will cost us more time).

Instead, we can use continuous aggregates to pre-aggregate the data we're querying over and over again, which reduces the amount of work TimescaleDB needs to do every time we run the query. (Continuous aggregates are like PostgreSQL materialized views. For more info, check out our continuous aggregates docs.)

CREATE MATERIALIZED VIEW player_yards_by_game_
WITH (timescaledb.continuous) AS
SELECT player_id, position, gameid,
 time_bucket(INTERVAL '1 day', "time") AS bucket,
 SUM(dis) AS yards
FROM tracking t
GROUP BY player_id, position, gameid, bucket;

After running this query and creating a continuous aggregate, we can modify that first query just slightly, using this as our basis table.

WITH total_position_yards AS (
    SELECT sum(yards) position_yards, POSITION, gameid 
FROM player_yards_by_game t 
    GROUP BY POSITION, gameid)
SELECT avg(position_yards), position, game_date
FROM game g
INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid
WHERE POSITION IN ('QB','RB','WR','TE')
GROUP BY game_date, POSITION
ORDER BY game_date, position;

We get the same result, but now the query runs in 100ms - 800x faster!

Advanced SQL data analysis with TimescaleDB hyperfunctions

Finally, the more we dug into the data, the more and more we found we needed (or wanted) functions specifically tuned for time-series data analysis to answer the types of questions we wanted to ask.

It is for this kind of analysis that we built TimescaleDB hyperfunctions, a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code.

Grouping data into percentiles

The NFL dataset is a great use case for percentiles. Being able to quickly find players that perform better or worse than some cohort is really powerful.

As an example, we'll use the same continuous aggregate we created earlier (total yards, per game, per player) to find the median total yards traveled by position for each game.

WITH sum_yards AS (
--Add position to the table to allow for grouping by it later
 SELECT a.player_id, display_name, SUM(yards) AS yards, p.position, gameid
 FROM player_yards_by_game a
 LEFT JOIN player p ON a.player_id = p.player_id
 GROUP BY a.player_id, display_name, p.position, gameid
)
--Find the mean and median for each position type
SELECT position, mean(percentile_agg(yards)) AS mean_yards, approx_percentile(0.5, percentile_agg(yards)) AS median_yards
FROM sum_yards
WHERE POSITION IS NOT null
GROUP BY position
ORDER BY mean_yards DESC;

position	mean_yards	median_yards
FS	595.583433048431	626.388099960848
CB	572.3336749867212	592.2175990890378
WR	552.6508570179277	555.5030569048633
S	530.6436781609186	550.5961518474892
SS	522.5604103343453	551.1296628916651
MLB	462.70229007633407	490.77906906009343
ILB	402.7882871125599	403.3779668359464
OLB	393.40014271151847	390.6742117791442
QB	334.7025466893028	352.1192705472368
LB	328.9812527472519	257.72003396053884
TE	327.9515596330271	257.72003396053884

Finding extreme outliers

Finally, we can build upon this percentile query to find players at each position that run more than 95% of all other players at that position. For some positions, like wide receiver or free safety, this could help us find the “outlier” players that are able to travel the field consistently throughout a game – and make plays!

WITH sum_yards AS (
--Add position to the table to allow for grouping by it later
 SELECT a.player_id, display_name, SUM(yards) AS yards, p.position
 FROM player_yards_by_game a
 LEFT JOIN player p ON a.player_id = p.player_id
 GROUP BY a.player_id, display_name, p.position
),
position_percentile AS (
    SELECT POSITION, approx_percentile(0.95, percentile_agg(yards)) AS p95
    FROM sum_yards 
    GROUP BY position
)
SELECT a.POSITION, a.display_name, yards, p95
    FROM sum_yards a
    LEFT JOIN position_percentile pp ON a.POSITION = pp.position
    WHERE yards >= p95
AND a.POSITION IN ('WR','FS','QB','TE')
ORDER BY position;

position	display_name	yards	p95
FS	Eric Weddle	13869.759999999997	12320.288323166456
FS	Adrian Amos	12989.439999999966	12320.288323166456
FS	Tyrann Mathieu	12565.219999999956	12320.288323166456
QB	Aaron Rodgers	7422.35999999995	6667.51452813257
QB	Patrick Mahomes	6985.989999999952	6667.51452813257
QB	Matt Ryan	6759.959999999969	6667.51452813257
TE	Zach Ertz	13124.58999999995	10667.986199523099
TE	Jimmy Graham	12693.679999999982	10667.986199523099
TE	Travis Kelce	12218.129999999957	10667.986199523099
TE	David Njoku	11502.159999999965	10667.986199523099
TE	George Kittle	11058.099999999975	10667.986199523099
TE	Kyle Rudolph	10761.949999999968	10667.986199523099
TE	Jared Cook	10678.22999999998	10667.986199523099
WR	Antonio Brown	16877.559999999965	14271.23409723974
WR	Brandin Cooks	15510.01999999995	14271.23409723974
WR	JuJu Smith-Schuster	15492.76999999996	14271.23409723974
WR	Robert Woods	15253.179999999958	14271.23409723974
WR	Nelson Agholor	15180.32999999997	14271.23409723974
WR	Tyreek Hill	15106.609999999973	14271.23409723974
WR	Zay Jones	14790.589999999967	14271.23409723974
WR	Sterling Shepard	14673.79999999996	14271.23409723974
WR	Mike Evans	14620.129999999983	14271.23409723974
WR	Davante Adams	14574.509999999951	14271.23409723974
WR	Kenny Golladay	14354.499999999973	14271.23409723974
WR	Jarvis Landry	14281.509999999971	14271.23409723974

Where can the data take you?

As you’ve seen in this example, time-series data is everywhere. Being able to harness it gives you a huge advantage, whether you’re working on a professional solution or a personal project.

We’ve shown you a few ways that time-series queries can unlock interesting insights, give you a greater appreciation for the game and its players, and (hopefully) inspired you to dig into the data yourself.

To get started with the NFL data:

Spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days.
Follow our complete tutorial for step-by-step instructions for preparing and ingesting the dataset, along with several more queries to help you glean insights from the dataset.

If you’re new to time-series data or just have some questions about how to use TimescaleDB to analyze the NFL’s dataset, join our public Slack community. You’ll find Timescale engineers and thousands of time-series enthusiasts from around the world – and we’ll be happy to help you.

🙏 We’d like to thank the NFL for making this data available, and the millions of passionate fans around the world who make the NFL such an exciting game to watch.

And, Geaux Saints 🏈!

The original blog post was a collaboration between

Attila Toth, Miranda Auhl, Ryan Booz

DEV Community: mirandaauhl

PostgreSQL vs Python for data cleaning: A guide

Introduction

A recap of the data analysis process

About the sample dataset

Cleaning the data

Note on cleaning approach:

Correcting structural issues

TimescaleDB hypertables and why timestamp is important

Changing date-time structure

Changing column data types

Creating a VIEW

Creating or generating relevant data

Adding data to a hypertable

Renaming values

Fill in missing data

Wrap Up

PostgreSQL vs Python for data evaluation: what, why, and how

Table of contents

Introduction

Recap

SQL basics

A quick note on the data

Evaluating the data

Reading the raw data

Finding and observing “categorical” column values in my dataset

Sorting my data by specific columns

Displaying grouped data

Finding abnormalities in the database

Time bucket function

CTEs or common table expressions

Looking at general trends

Wrap up

Functionality Glossary

Speeding up data analysis with TimescaleDB and PostgreSQL

Table of contents

Common data analysis tools and “the problem”

Data analysis issue #1: storing and accessing data

Data analysis issue #2: maximizing analysis speed and computation efficiency (the bigger the dataset, the bigger the problem)

Data analysis issue #3: storing and maintaining scripts for data analysis

Data analysis issue #4: easily utilizing new or additional technologies

Wrapping up

Hacking NFL data with PostgreSQL, TimescaleDB, and SQL

Table of contents

The NFL dataset

Accessing the data

Let's start exploring!

The power of SQL

Average yards per position, per game

Number of plays by offensive player

Sack percentage by quarterback on passing plays

Faster insights with PostgreSQL and TimescaleDB

Faster queries with TimescaleDB continuous aggregates

Advanced SQL data analysis with TimescaleDB hyperfunctions

Grouping data into percentiles

Finding extreme outliers

Where can the data take you?

Creating a `VIEW`