<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: mirandaauhl</title>
    <description>The latest articles on DEV Community by mirandaauhl (@mirandaauhl).</description>
    <link>https://dev.to/mirandaauhl</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F675083%2Fd31e976f-8654-4f52-8bf4-eee064fc0876.jpeg</url>
      <title>DEV Community: mirandaauhl</title>
      <link>https://dev.to/mirandaauhl</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mirandaauhl"/>
    <language>en</language>
    <item>
      <title>PostgreSQL vs Python for data cleaning: A guide</title>
      <dc:creator>mirandaauhl</dc:creator>
      <pubDate>Wed, 08 Dec 2021 22:34:22 +0000</pubDate>
      <link>https://dev.to/tigerdata/postgresql-vs-python-for-data-cleaning-a-guide-3o5d</link>
      <guid>https://dev.to/tigerdata/postgresql-vs-python-for-data-cleaning-a-guide-3o5d</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;During analysis, you rarely - if ever - get to go directly from evaluating data to transforming and analyzing it. Sometimes to properly evaluate your data, you may need to do some pre-cleaning before you get to the main data cleaning, and that’s a lot of cleaning! In order to accomplish all this work, you may use Excel, R, or Python, but are these the best tools for data cleaning tasks?&lt;/p&gt;

&lt;p&gt;In this blog post, I explore some classic &lt;strong&gt;data cleaning&lt;/strong&gt; scenarios and show how you can perform them &lt;em&gt;directly within your database&lt;/em&gt; using &lt;a href="https://www.timescale.com/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-website" rel="noopener noreferrer"&gt;TimescaleDB&lt;/a&gt; and &lt;a href="https://www.postgresql.org/" rel="noopener noreferrer"&gt;PostgreSQL&lt;/a&gt;, replacing the tasks that you may have done in Excel, R, or Python. TimescaleDB and PostgreSQL cannot replace these tools entirely, but they can help your data munging/cleaning tasks be more efficient and, in turn, let Excel, R, and Python shine where they do best: in visualizations, modeling, and machine learning.  &lt;/p&gt;

&lt;p&gt;Cleaning is a very important part of the analysis process and generally can be the most grueling from my experience! By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks one time rather than repetitively within a script, saving me considerable time in the long run.&lt;/p&gt;

&lt;h1&gt;
  
  
  A recap of the data analysis process
&lt;/h1&gt;

&lt;p&gt;I began this series of posts on &lt;a href="https://blog.timescale.com/blog/speeding-up-data-analysis/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=first-post" rel="noopener noreferrer"&gt;data analysis&lt;/a&gt; by presenting the following summary of the analysis process:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8v6xr0dvn0i2brdxfzv.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8v6xr0dvn0i2brdxfzv.jpeg" alt="Data Analysis Lifecycle" width="800" height="202"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first three steps of the analysis lifecycle (evaluate, clean, transform) comprise the “data munging” stages of analysis. Historically, I have done my data munging and modeling all within Python or R, these being excellent options for analysis. However, once I was introduced to PostgreSQL and TimescaleDB, I found how efficient and fast it was to do my data munging directly within my database. In my previous post, I focused on showing &lt;a href="https://blog.timescale.com/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=previous-post" rel="noopener noreferrer"&gt;data evaluation&lt;/a&gt; techniques and how you can replace tasks previously done in Python with PostgreSQL and TimescaleDB code. I now want to move on to the second step, &lt;strong&gt;data cleaning&lt;/strong&gt;. Cleaning may not be the most glamorous step in the analysis process, but it is absolutely crucial to creating accurate and meaningful models.&lt;/p&gt;

&lt;p&gt;As I mentioned &lt;a href="https://blog.timescale.com/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=previous-post" rel="noopener noreferrer"&gt;in my last post&lt;/a&gt;, my first job out of college was at an energy and sustainability solutions company that focused on monitoring all different kinds of utility usage - such as electricity, water, sewage, you name it - to figure out how our clients’ buildings could be more efficient. My role at this company was to perform data analysis and business intelligence tasks.&lt;/p&gt;

&lt;p&gt;Throughout my time in this job, I got the chance to use many popular data analysis tools including Excel, R, and Python. But once I tried using a database to perform my data munging tasks - specifically PostgreSQL and TimescaleDB - I realized how efficient and straightforward analysis, and particularly cleaning tasks, could be when done directly in a database. &lt;/p&gt;

&lt;p&gt;Before using a database for data cleaning tasks, I would often find either columns or values that needed to be edited. I would pull the raw data from a CSV file or database, then make any adjustments to this data within my Python script. This meant that every time I ran my Python script, I would have to wait for my machine to spend computational time setting up and cleaning my data. This means that I lost time with every run of the script. Additionally, if I wanted to share cleaned data with colleagues, I would have to run the script or pass it along to them to run. This extra computational time could add up depending on the project. &lt;/p&gt;

&lt;p&gt;Instead, with PostgreSQL, I can write a query to do this cleaning once and then store the results in a table. I wouldn’t need to spend time cleaning and transforming data again and again with a Python script, I could just set up the cleaning process in my database and call it a day! Once I started to make cleaning changes directly within my database, I was able to skip performing cleaning tasks within Python and simply focus on jumping straight into modeling my data. &lt;/p&gt;

&lt;p&gt;To keep this post as succinct as possible, I chose to only show side-by-side code comparisons for Python and PostgreSQL. If you have any questions about other tools or languages, please feel free to join our &lt;a href="https://slack.timescale.com/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=slack" rel="noopener noreferrer"&gt;Slack channel&lt;/a&gt;, where you can ask the Timescale community, or me, specific questions about Timescale or PostgreSQL functionality 😊. I’d love to hear from you!&lt;/p&gt;

&lt;p&gt;Additionally, as we explore TimescaleDB and PostgreSQL functionality together, you may be eager to try things out right away! Which is awesome! The easiest way to get started is by signing up for &lt;a href="https://www.timescale.com/timescale-signup/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-signup" rel="noopener noreferrer"&gt;a free 30-day trial&lt;/a&gt; of Timescale Cloud (if you prefer self-hosting, you can always &lt;a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-self-hosted" rel="noopener noreferrer"&gt;install and manage&lt;/a&gt; TimescaleDB on your own PostgreSQL instances). Learn more by &lt;a href="https://docs.timescale.com/timescaledb/latest/tutorials/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-tutorials" rel="noopener noreferrer"&gt;following one of our many tutorials&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now, before we dip into things and get our data, as Outkast best put it, “So fresh, So clean”, I want to quickly cover the data set I will be using. In addition, I also want to note that all the code I show will assume you have some basic knowledge of SQL. If you are not familiar with SQL, don’t worry! In my last post, I included a section on SQL basics which you can find &lt;a href="https://blog.timescale.com/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=previous-post-sql-basics/#sql-basics/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  About the sample dataset
&lt;/h1&gt;

&lt;p&gt;In my experience within the data science realm, I have done the majority of my data cleaning after evaluation. However, sometimes it can be beneficial to clean data, evaluate, and then clean again. The process you choose is dependent on the initial state of your data and how easy it is to evaluate. For the data set I will use today, I would likely do some initial cleaning before evaluation and then clean again after, and I will show you why. &lt;/p&gt;

&lt;p&gt;I got the following &lt;a href="https://www.kaggle.com/jaganadhg/house-hold-energy-data" rel="noopener noreferrer"&gt;IoT data set from Kaggle&lt;/a&gt;, where a very generous individual shared their energy consumption readings from their apartment in San Jose CA, this data incrementing every 15 minutes. While this is awesome data, it is structured a little differently than I would like. The raw data set follows this schema:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3yw6mj7lg01oykgddtl7.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3yw6mj7lg01oykgddtl7.jpg" alt="energy_usage_staging table" width="800" height="636"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and appears like this…&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;type&lt;/th&gt;
&lt;th&gt;date&lt;/th&gt;
&lt;th&gt;start_time&lt;/th&gt;
&lt;th&gt;end_time&lt;/th&gt;
&lt;th&gt;usage&lt;/th&gt;
&lt;th&gt;units&lt;/th&gt;
&lt;th&gt;cost&lt;/th&gt;
&lt;th&gt;notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22&lt;/td&gt;
&lt;td&gt;00:00:00&lt;/td&gt;
&lt;td&gt;00:14:00&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22&lt;/td&gt;
&lt;td&gt;00:15:00&lt;/td&gt;
&lt;td&gt;00:29:00&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22&lt;/td&gt;
&lt;td&gt;00:30:00&lt;/td&gt;
&lt;td&gt;00:44:00&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22&lt;/td&gt;
&lt;td&gt;00:45:00&lt;/td&gt;
&lt;td&gt;00:59:00&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22&lt;/td&gt;
&lt;td&gt;01:00:00&lt;/td&gt;
&lt;td&gt;01:14:00&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22&lt;/td&gt;
&lt;td&gt;01:15:00&lt;/td&gt;
&lt;td&gt;01:29:00&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22&lt;/td&gt;
&lt;td&gt;01:30:00&lt;/td&gt;
&lt;td&gt;01:44:00&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22&lt;/td&gt;
&lt;td&gt;01:45:00&lt;/td&gt;
&lt;td&gt;01:59:00&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In order to do any type of analysis on this data set, I want to clean it up. A few things that quickly come to mind include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The cost is seen as a text data type which will cause some issues.&lt;/li&gt;
&lt;li&gt;The time columns are split apart which could cause some problems if I want to create plots over time or perform any type of modeling based on time.&lt;/li&gt;
&lt;li&gt;I may also want to filter the data based on various parameters that have to do with time, such as day of the week or holiday identification (both potentially play into how energy is used within the household). &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In order to fix all of these things and get more valuable data evaluation and analysis, I will have to clean the incoming data! So without further ado, let’s roll up our sleeves and dig in!&lt;/p&gt;

&lt;h1&gt;
  
  
  Cleaning the data
&lt;/h1&gt;

&lt;p&gt;I will show most of the techniques I have used in the past while working in data science. While these examples are not exhaustive, I hope they will cover many of the cleaning steps you perform during your own analysis, helping to make your cleaning tasks more efficient by using PostgreSQL and TimescaleDB.&lt;/p&gt;

&lt;p&gt;Please feel free to explore these various techniques and skip around if you need! There is a lot here, and I designed it to be a helpful glossary of tools that you could use as you need.&lt;/p&gt;

&lt;p&gt;The techniques that I will cover include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Correcting structural issues&lt;/li&gt;
&lt;li&gt;Creating or generating relevant data&lt;/li&gt;
&lt;li&gt;Adding data to a hypertable&lt;/li&gt;
&lt;li&gt;Renaming columns or tables&lt;/li&gt;
&lt;li&gt;Fill in missing values&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Note on cleaning approach:
&lt;/h3&gt;

&lt;p&gt;There are many ways that I could approach the cleaning process in PostgreSQL. I could create a table then &lt;a href="https://www.postgresql.org/docs/current/sql-altertable.html" rel="noopener noreferrer"&gt;&lt;code&gt;ALTER&lt;/code&gt;&lt;/a&gt; it as I clean, I could create multiple tables as I add or change data, or I could work with &lt;a href="https://www.postgresql.org/docs/14/sql-createview.html" rel="noopener noreferrer"&gt;&lt;code&gt;VIEW&lt;/code&gt;s&lt;/a&gt;. Depending on the size of my data, any of these approaches &lt;em&gt;could&lt;/em&gt; make sense, however, they will have different computational consequences.&lt;/p&gt;

&lt;p&gt;You may have noticed above that my raw data table was called &lt;code&gt;energy_usage_staging&lt;/code&gt;. This is because I decided that given the state of my raw data, it would be best for me to place the raw data in a &lt;em&gt;staging table&lt;/em&gt;, clean it using &lt;code&gt;VIEW&lt;/code&gt;s, then insert it into a more usable table as part of my cleaning process. This move from raw table to the usable table could happen even before the evaluation step of analysis. As I discussed above, sometimes data cleaning has to occur after AND before evaluating your data. Regardless, this data needs to be cleaned and I wanted to use the most efficient method possible. In this case, that meant using a staging table and leveraging the efficiency and power of PostgreSQL &lt;code&gt;VIEW&lt;/code&gt;s, something I will talk about later.&lt;/p&gt;

&lt;p&gt;Generally, if you are dealing with a lot of data, altering an existing table in PostgreSQL can be costly. For this post, I will show you how to build up clean data using &lt;code&gt;VIEW&lt;/code&gt;s along with additional tables. This method of cleaning is more efficient and sets you up for the next blog post about data transformation which includes the use of scripts in PostgreSQL.&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Correcting structural issues
&lt;/h2&gt;

&lt;p&gt;Right off the bat, I know that I need to do some data refactoring on my raw table due to data types. Notice that we have &lt;code&gt;date&lt;/code&gt; and time columns separated and &lt;code&gt;costs&lt;/code&gt; is recorded as a text data type. I need to convert my separated date time columns to a timestamp and the &lt;code&gt;cost&lt;/code&gt; column to float4. But before I show that, I want to talk about why conversion to timestamp is beneficial.&lt;/p&gt;

&lt;h3&gt;
  
  
  TimescaleDB hypertables and why timestamp is important
&lt;/h3&gt;

&lt;p&gt;For those of you not familiar with the structure of &lt;a href="https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=hypertables-chunks" rel="noopener noreferrer"&gt;TimescaleDB hypertables&lt;/a&gt;, they are at the basis of how we efficiently query and manipulate time-series data. Timescale hypertables are partitioned based on time, and more specifically by the time column you specify upon creation of the table.&lt;/p&gt;

&lt;p&gt;The data is partitioned by timestamp into "chunks" so that every row in the table belongs to some &lt;em&gt;chunk&lt;/em&gt; based on a time range. We then use these time chunks to help query the rows so that you can get more efficient querying and data manipulation based on time. This image represents the difference between a normal table and our special hypertables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpvw87tb0201nsg318udh.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpvw87tb0201nsg318udh.jpg" alt="Hypertables example" width="800" height="609"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Changing date-time structure
&lt;/h3&gt;

&lt;p&gt;Because I want to utilize TimescaleDB functionality to the fullest, such as continuous aggregates and faster time based queries, I want to restructure the &lt;code&gt;energy_usage_staging&lt;/code&gt; table's &lt;code&gt;date&lt;/code&gt; and time columns. I could use the &lt;code&gt;date&lt;/code&gt; column for my hypertable partitioning, however, I would have limited control over manipulating my data based on time. It is more flexible and space efficient to have a single column with a timestamp than it is to have separate columns with date and time. I can always extract the date or time from the timestamp if I want to later!  &lt;/p&gt;

&lt;p&gt;Looking back at the table structure, I should be able to get a usable timestamp value from the &lt;code&gt;date&lt;/code&gt; and &lt;code&gt;start_time&lt;/code&gt; columns as the &lt;code&gt;end_time&lt;/code&gt; really doesn’t give me that much useful information. Thus, I want to essentially combine these two columns to form a new timestamp column, let’s see how I can do that using SQL. Spoiler alert, it is as simple as an algebraic statement. How cool is that?!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL code:&lt;/strong&gt;&lt;br&gt;
In PostgreSQL I can create the column without inserting it into the database just yet. Since I want to create a NEW table from this staging one, I don’t want to add more columns or tables just yet.&lt;/p&gt;

&lt;p&gt;Let’s first compare the original columns with our new generated column. For this query I simply &lt;em&gt;add&lt;/em&gt; the two columns together. The &lt;code&gt;AS&lt;/code&gt; keyword just allows me to rename the column to whatever I would like, in this case being &lt;code&gt;time&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--add the date column to the start_time column&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_usage_staging&lt;/span&gt; &lt;span class="n"&gt;eus&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;date&lt;/th&gt;
&lt;th&gt;start_time&lt;/th&gt;
&lt;th&gt;time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2016-10-22&lt;/td&gt;
&lt;td&gt;00:00:00&lt;/td&gt;
&lt;td&gt;2016-10-22 00:00:00.000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-10-22&lt;/td&gt;
&lt;td&gt;00:15:00&lt;/td&gt;
&lt;td&gt;2016-10-22 00:15:00.000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-10-22&lt;/td&gt;
&lt;td&gt;00:30:00&lt;/td&gt;
&lt;td&gt;2016-10-22 00:30:00.000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-10-22&lt;/td&gt;
&lt;td&gt;00:45:00&lt;/td&gt;
&lt;td&gt;2016-10-22 00:45:00.000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-10-22&lt;/td&gt;
&lt;td&gt;01:00:00&lt;/td&gt;
&lt;td&gt;2016-10-22 01:00:00.000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-10-22&lt;/td&gt;
&lt;td&gt;01:15:00&lt;/td&gt;
&lt;td&gt;2016-10-22 01:15:00.000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Python code:&lt;/strong&gt;&lt;br&gt;
In Python, the easiest way to do this is to add a new column to the dataframe. Notice that in Python I would have to concatenate the two columns along with a defined space, then convert that column to datetime.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;energy_stage_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_stage_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;energy_stage_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start_time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_stage_df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start_time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Changing column data types
&lt;/h3&gt;

&lt;p&gt;Next, I want to change the data type of my cost column from text to float. Again, this is straightforward in PostgreSQL with the &lt;a href="https://www.postgresql.org/docs/14/functions-formatting.html" rel="noopener noreferrer"&gt;&lt;code&gt;TO_NUMBER()&lt;/code&gt;&lt;/a&gt; function. &lt;/p&gt;

&lt;p&gt;The format of the function is as follows: &lt;code&gt;TO_NUMBER(‘text’, ‘format’)&lt;/code&gt; . The ‘format’ input is a PostgreSQL specific string that you can build depending on what type of text you want to convert. In our case we have a &lt;code&gt;$&lt;/code&gt; symbol followed by a numeric set up &lt;code&gt;0.00&lt;/code&gt;. For the format string I decided to use ‘L99D99’. The L lets PostgreSQL know there is a money symbol at the beginning of the text, the 9s let the system know I have numeric values, and then the D stands for a decimal point.&lt;/p&gt;

&lt;p&gt;I decided to cap the conversion on values that would be less than or equal to ‘$99.99’ because the cost column has no values greater than 0.65. If you were planning to convert a column with larger numeric values, you would want to account for that by adding in a G for commas. For example, say you have a cost column with text values like this ‘$1,672,278.23’ then you would want to format the string like this ‘L9G999G999D99’&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--create a new column called cost_new with the to_number() function&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TO_NUMBER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"cost"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'L9G999D99'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;cost_new&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_usage_staging&lt;/span&gt; &lt;span class="n"&gt;eus&lt;/span&gt;  
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;cost_new&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;cost&lt;/th&gt;
&lt;th&gt;cost_new&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;$0.65&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;$0.65&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;$0.65&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;$0.57&lt;/td&gt;
&lt;td&gt;0.57&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;$0.46&lt;/td&gt;
&lt;td&gt;0.46&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;$0.46&lt;/td&gt;
&lt;td&gt;0.46&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;$0.46&lt;/td&gt;
&lt;td&gt;0.46&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;$0.46&lt;/td&gt;
&lt;td&gt;0.46&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Python code:&lt;/strong&gt;&lt;br&gt;
For Python, I used a lambda function that systematically replaces all the ‘$’ signs with empty strings. This can be fairly inefficient.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;energy_stage_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost_new&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_numeric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_stage_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;$&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_stage_df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost_new&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Creating a &lt;code&gt;VIEW&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Now that I know how to convert my columns, I can combine the two queries and create a &lt;code&gt;VIEW&lt;/code&gt; of my new restructured table. A &lt;a href="https://www.postgresql.org/docs/14/sql-createview.html" rel="noopener noreferrer"&gt;&lt;code&gt;VIEW&lt;/code&gt;&lt;/a&gt; is a PostgreSQL object which allows you to define a query and call it by it’s &lt;code&gt;VIEW&lt;/code&gt;s name, as if it were a table within your database. I can use the following query to generate the data I want and then create a &lt;code&gt;VIEW&lt;/code&gt; that I can query it as if it were a table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- query the right data that I want&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="nv"&gt;"usage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="n"&gt;TO_NUMBER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"cost"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'L9G999D99'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="n"&gt;notes&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_usage_staging&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;type&lt;/th&gt;
&lt;th&gt;time&lt;/th&gt;
&lt;th&gt;usage&lt;/th&gt;
&lt;th&gt;units&lt;/th&gt;
&lt;th&gt;cost&lt;/th&gt;
&lt;th&gt;notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:00:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:15:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:30:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:45:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 01:00:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 01:15:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 01:30:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 01:45:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 02:00:00.000&lt;/td&gt;
&lt;td&gt;0.02&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 02:15:00.000&lt;/td&gt;
&lt;td&gt;0.02&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I decided to call my &lt;code&gt;VIEW&lt;/code&gt; &lt;code&gt;energy_view&lt;/code&gt;. Now, when I want to do further cleaning, I can just specify its name in the &lt;code&gt;FROM&lt;/code&gt; statement.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--create view from the query above&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;energy_view&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="nv"&gt;"usage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="n"&gt;TO_NUMBER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"cost"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'L9G999D99'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="n"&gt;notes&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_usage_staging&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;energy_stage_df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;usage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;units&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost_new&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost_new&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is important to note that with PostgreSQL &lt;code&gt;VIEW&lt;/code&gt;s, the data inside of them have to be recalculated every time you query it. This is why we want to insert our &lt;code&gt;VIEW&lt;/code&gt; data into a hypertable once we have the data set up just right. You can think of &lt;code&gt;VIEW&lt;/code&gt;s as a shorthand version of the &lt;a href="https://blog.timescale.com/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-tutorials/#cte/" rel="noopener noreferrer"&gt;CTEs &lt;code&gt;WITH&lt;/code&gt; &lt;code&gt;AS&lt;/code&gt;&lt;/a&gt; statement I discussed in my last post.&lt;/p&gt;

&lt;p&gt;We are now one step closer to cleaner data!&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating or generating relevant data
&lt;/h2&gt;

&lt;p&gt;With some quick investigation, we can see that the notes column is blank for this data set. To check this I just need to include a &lt;code&gt;WHERE&lt;/code&gt; clause and specify where &lt;code&gt;notes&lt;/code&gt; are not equal to an empty string. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_view&lt;/span&gt; &lt;span class="n"&gt;ew&lt;/span&gt;
&lt;span class="c1"&gt;-- where notes are not equal to an empty string&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results come out empty&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;notnull&lt;/span&gt;&lt;span class="p"&gt;()])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since the notes are blank, I would like to replace the column with various sets of additional information that I could use later on during modeling. One thing I would like to add in particular, is a column that specifies the day of the week. To do this I can use the &lt;code&gt;EXTRACT()&lt;/code&gt; command. The &lt;a href="https://www.postgresql.org/docs/14/functions-datetime.html" rel="noopener noreferrer"&gt;&lt;code&gt;EXTRACT()&lt;/code&gt;&lt;/a&gt; command is a PostgreSQL date/time function that allows you to extract various date/time elements. For our column, PostgreSQL has the specification DOW (day-of-week) which maps 0 to Sunday through to 6 for Saturday.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--extract day-of-week from date column and cast the output to an int&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DOW&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_view&lt;/span&gt; &lt;span class="n"&gt;ew&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;type&lt;/th&gt;
&lt;th&gt;time&lt;/th&gt;
&lt;th&gt;usage&lt;/th&gt;
&lt;th&gt;units&lt;/th&gt;
&lt;th&gt;cost&lt;/th&gt;
&lt;th&gt;notes&lt;/th&gt;
&lt;th&gt;day_of_week&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:00:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:15:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:30:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:45:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 01:00:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 01:15:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Python code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day_of_week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dayofweek&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Additionally, we may want to add another column that specifies if a day occurs over a weekend or weekday. I will do this by creating a boolean column, where &lt;code&gt;true&lt;/code&gt; represents a weekend, and &lt;code&gt;false&lt;/code&gt; represents a weekday. To do this, I will apply a &lt;a href="https://www.postgresql.org/docs/14/plpgsql-control-structures.html" rel="noopener noreferrer"&gt;&lt;code&gt;CASE&lt;/code&gt;&lt;/a&gt; statement. With this command I can specify “when-then” statements (similar to “if-then” statements in coding) where I can say &lt;code&gt;WHEN&lt;/code&gt; a &lt;code&gt;day_of_week&lt;/code&gt; value is &lt;code&gt;IN&lt;/code&gt; the set (0,6) &lt;code&gt;THEN&lt;/code&gt; the output should be &lt;code&gt;true&lt;/code&gt;, &lt;code&gt;ELSE&lt;/code&gt; the value should be &lt;code&gt;false&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DOW&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="c1"&gt;--use the case statement to make a column true when records fall on a weekend aka 0 and 6&lt;/span&gt;
&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DOW&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
    &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;is_weekend&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_view&lt;/span&gt; &lt;span class="n"&gt;ew&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;type&lt;/th&gt;
&lt;th&gt;time&lt;/th&gt;
&lt;th&gt;usage&lt;/th&gt;
&lt;th&gt;units&lt;/th&gt;
&lt;th&gt;cost&lt;/th&gt;
&lt;th&gt;day_of_week&lt;/th&gt;
&lt;th&gt;is_weekend&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:00:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:15:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:30:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:45:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 01:00:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Fun fact: you can do the same query without a &lt;code&gt;CASE&lt;/code&gt; statement, however it only works for binary columns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--another method to create a binary column&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DOW&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DOW&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;is_weekend&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_view&lt;/span&gt; &lt;span class="n"&gt;ew&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Python code:&lt;/strong&gt;&lt;br&gt;
Notice that in Python, the weekends are represented by numbers 5 and 6 vs the PostgreSQL weekend values 0 and 6.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is_weekend&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day_of_week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;isin&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And maybe things then start getting real crazy, maybe you want to add more parameters! &lt;/p&gt;

&lt;p&gt;Let’s consider holidays. Now you may be asking “Why in the world would we do that?!”, but often people have time off during some of the holidays within the US. Since this individual lives within the US, they likely have at least &lt;em&gt;some&lt;/em&gt; of the holidays off whether they are the day of OR a federal holiday. Where there are days off, there could be a difference in energy usage. To help guide my analysis, I want to include the identification of holidays. To do this, I’m going to create another boolean column that identifies when a federal holiday occurs. &lt;/p&gt;

&lt;p&gt;To do this, I am going to use TimescaleDB’s &lt;code&gt;time_bucket()&lt;/code&gt; function. The &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=time-bucket-docs" rel="noopener noreferrer"&gt;&lt;code&gt;time_bucket()&lt;/code&gt;&lt;/a&gt; function is one of the functions I discussed in detail within my &lt;a href="https://blog.timescale.com/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/?utm_source=tds&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=last-blog-time-bucket#timebucket/" rel="noopener noreferrer"&gt;previous post&lt;/a&gt;. Essentially, I need to use this function to make sure all time values within a single day get accounted for. Without using the &lt;code&gt;time_bucket()&lt;/code&gt; function, I would only see changes to the row associated with the 12am time period. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL code:&lt;/strong&gt;&lt;br&gt;
After I create a holiday table, I can then use the data from it within my query. I also decided to use the non-case syntax for this query. Note that you can use either!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--create table for the holidays&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;holidays&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;--insert the holidays into table&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;holidays&lt;/span&gt; 
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2016-11-11'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2016-11-24'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2016-12-24'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2016-12-25'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2016-12-26'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2017-01-01'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2017-01-02'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2017-01-16'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2017-02-20'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2017-05-29'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2017-07-04'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2017-09-04'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2017-10-9'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2017-11-10'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2017-11-23'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2017-11-24'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2017-12-24'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2017-12-25'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2018-01-01'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2018-01-15'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2018-02-19'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2018-05-28'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2018-07-4'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2018-09-03'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2018-10-8'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DOW&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DOW&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;is_weekend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="c1"&gt;-- I can then select the data from the holidays table directly within my IN statement&lt;/span&gt;
&lt;span class="n"&gt;time_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'1 day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;holidays&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;is_holiday&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_view&lt;/span&gt; &lt;span class="n"&gt;ew&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;type&lt;/th&gt;
&lt;th&gt;time&lt;/th&gt;
&lt;th&gt;usage&lt;/th&gt;
&lt;th&gt;units&lt;/th&gt;
&lt;th&gt;cost&lt;/th&gt;
&lt;th&gt;day_of_week&lt;/th&gt;
&lt;th&gt;is_weekend&lt;/th&gt;
&lt;th&gt;is_holiday&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:00:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:15:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:30:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:45:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 01:00:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 01:15:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Python code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;holidays&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2016-11-11&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2016-11-24&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2016-12-24&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2016-12-25&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2016-12-26&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2017-01-01&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2017-01-02&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2017-01-16&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2017-02-20&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2017-05-29&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2017-07-04&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2017-09-04&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2017-10-9&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2017-11-10&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2017-11-23&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2017-11-24&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2017-12-24&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2017-12-25&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2018-01-01&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2018-01-15&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2018-02-19&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2018-05-28&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2018-07-4&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2018-09-03&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2018-10-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is_holiday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day_of_week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;isin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;holidays&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point, I’m going to save this expanded table into another &lt;code&gt;VIEW&lt;/code&gt; so that I can call the data without writing out the query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--create another view with the data from our first round of cleaning&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;energy_view_exp&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DOW&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DOW&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;is_weekend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;time_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'1 day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;holidays&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;is_holiday&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_view&lt;/span&gt; &lt;span class="n"&gt;ew&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You may be asking, “Why did you create these as boolean columns??”, a very fair question! You see, I may want to use these columns for filtering during analysis, something I commonly do during my own analysis process. In PostgreSQL, when you use boolean columns you can filter things super easily. For example, say that I want to use my table query so far and show only the data that occurs over the weekend &lt;code&gt;AND&lt;/code&gt; a holiday. I can do this simply by adding in a &lt;code&gt;WHERE&lt;/code&gt; statement along with the specified columns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--if you use binary columns, then you can filter with a simple WHERE statement&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_view_exp&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;is_weekend&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;is_holiday&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;type&lt;/th&gt;
&lt;th&gt;time&lt;/th&gt;
&lt;th&gt;usage&lt;/th&gt;
&lt;th&gt;units&lt;/th&gt;
&lt;th&gt;cost&lt;/th&gt;
&lt;th&gt;day_of_week&lt;/th&gt;
&lt;th&gt;is_weekend&lt;/th&gt;
&lt;th&gt;is_holiday&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-12-24 00:00:00.000&lt;/td&gt;
&lt;td&gt;0.34&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-12-24 00:15:00.000&lt;/td&gt;
&lt;td&gt;0.34&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-12-24 00:30:00.000&lt;/td&gt;
&lt;td&gt;0.34&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-12-24 00:45:00.000&lt;/td&gt;
&lt;td&gt;0.34&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-12-24 01:00:00.000&lt;/td&gt;
&lt;td&gt;0.34&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-12-24 01:15:00.000&lt;/td&gt;
&lt;td&gt;0.34&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Python code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is_weekend&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is_holiday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)].&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding data to a hypertable
&lt;/h2&gt;

&lt;p&gt;Now that I have new columns ready to go and I know how I would like my table to be structured, I can create a new hypertable and insert my cleaned data. In my own analysis with this data set, I may have done the cleaning up to this point BEFORE evaluating my data so that I can get a more meaningful evaluation step in analysis. What’s great is that you can use any of these techniques for general cleaning, whether that is before or after evaluation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;energy_usage&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="n"&gt;timestamptz&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;usage&lt;/span&gt; &lt;span class="n"&gt;float4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;units&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="n"&gt;float4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;day_of_week&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;is_weekend&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;is_holiday&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="c1"&gt;--command to create a hypertable&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;create_hypertable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'energy_usage'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'time'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;energy_usage&lt;/span&gt; 
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_view_exp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;br&gt;
&lt;a&gt;&lt;/a&gt;&lt;br&gt;
&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that if you had data continually coming in you could create a script within your database that automatically makes these changes when importing your data. That way you can have cleaned data ready to go in your database rather than processing and cleaning the data in your scripts every time you want to perform analysis. &lt;/p&gt;

&lt;p&gt;We will discuss this in detail in my next post, so make sure to stay tuned in if you want to know how to create scripts and keep data automatically updated!&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Renaming values
&lt;/h2&gt;

&lt;p&gt;Another valuable technique for cleaning data is being able to rename various items or remap categorical values. The importance of this skill is amplified by the &lt;a href="https://stackoverflow.com/questions/40427943/how-do-i-change-a-single-index-value-in-pandas-dataframe" rel="noopener noreferrer"&gt;popularity of this Python data analysis question on StackOverflow&lt;/a&gt;. The question states “How do I change a single index value in a pandas dataframe?”. Since PostgreSQL and TimescaleDB use relational table structures, renaming unique values can be fairly simple. &lt;/p&gt;

&lt;p&gt;When renaming specific index values within a table, you can do this “on the fly” by using PostgreSQL’s &lt;code&gt;CASE&lt;/code&gt; statement within the &lt;code&gt;SELECT&lt;/code&gt; query. Let’s say I don’t like Sunday being represented by a 0 in the &lt;code&gt;day_of_week&lt;/code&gt; column, but would prefer it to be a 7. I can do this with the following query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_weekend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="c1"&gt;-- you can use case to recode column values &lt;/span&gt;
&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;
&lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt; 
&lt;span class="k"&gt;END&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_usage&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Python code:&lt;/strong&gt;&lt;br&gt;
Caveat, this code would make Monday = 7 because the python DOW function has Monday set to 0 and Sunday set to 6. But this is how you would update one value within a column. Likely you would not want to do this exact action, I just wanted to show the python equivalent for reference.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;day_of_week&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day_of_week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, let’s say that I wanted to actually use the names of the days of the week instead of showing numeric values? For this example, I actually want to ditch the &lt;code&gt;CASE&lt;/code&gt; statement and create a mapping table. When you need to change various values, it will likely be more efficient to create a mapping table and then join to this table using the &lt;a href="https://www.postgresql.org/docs/14/queries-table-expressions.html" rel="noopener noreferrer"&gt;&lt;code&gt;JOIN&lt;/code&gt;&lt;/a&gt; command.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--first I need to create the table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;day_of_week_mapping&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;day_of_week_int&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;day_of_week_name&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;--then I want to add data to my table&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;day_of_week_mapping&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Sunday'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Monday'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Tuesday'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Wednesday'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Thursday'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Friday'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Saturday'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;--then I can join this table to my cleaning table to remap the days of the week&lt;/span&gt;
&lt;span class="k"&gt;SElECT&lt;/span&gt; &lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dowm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;day_of_week_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_weekend&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_usage&lt;/span&gt; &lt;span class="n"&gt;eu&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;day_of_week_mapping&lt;/span&gt; &lt;span class="n"&gt;dowm&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;dowm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;day_of_week_int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;eu&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;day_of_week&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;type&lt;/th&gt;
&lt;th&gt;time&lt;/th&gt;
&lt;th&gt;usage&lt;/th&gt;
&lt;th&gt;units&lt;/th&gt;
&lt;th&gt;cost&lt;/th&gt;
&lt;th&gt;day_of_week_name&lt;/th&gt;
&lt;th&gt;weekend&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2018-07-22 00:45:00.000&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.03&lt;/td&gt;
&lt;td&gt;Sunday&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2018-07-22 00:30:00.000&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.03&lt;/td&gt;
&lt;td&gt;Sunday&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2018-07-22 00:15:00.000&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.03&lt;/td&gt;
&lt;td&gt;Sunday&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2018-07-22 00:00:00.000&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.03&lt;/td&gt;
&lt;td&gt;Sunday&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2018-02-11 23:00:00.000&lt;/td&gt;
&lt;td&gt;0.04&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;Sunday&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Python:&lt;/strong&gt;&lt;br&gt;
In this case, python has similar mapping functions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day_of_week_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day_of_week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Sunday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Monday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Tuesday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Wednesday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Thursday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Friday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Saturday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hopefully, one of these techniques will be useful for you as you approach data renaming!&lt;/p&gt;

&lt;p&gt;Additionally, remember that if you would like to change the name of a column in your table, it is truly as easy as &lt;code&gt;AS&lt;/code&gt; (I couldn’t not use such a ridiculous statement 😂). When you use the &lt;code&gt;SELECT&lt;/code&gt; statement, you can rename you columns like so,&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;usage_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;time_stamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dollar_amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_view_exp&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;usage_type&lt;/th&gt;
&lt;th&gt;time_stamp&lt;/th&gt;
&lt;th&gt;usage&lt;/th&gt;
&lt;th&gt;units&lt;/th&gt;
&lt;th&gt;dollar_amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:00:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:15:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:30:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Electric usage&lt;/td&gt;
&lt;td&gt;2016-10-22 00:45:00.000&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;kWh&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Python code:&lt;/strong&gt;&lt;br&gt;
Comparatively, renaming columns in Python can be a huge pain. This is an area where SQL is not only faster, but also just more elegant in its code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;usage_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;time_stamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dollar_amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;usage_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;time_stamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;usage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;units&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dollar_amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Fill in missing data
&lt;/h2&gt;

&lt;p&gt;Another common problem in the data cleaning process is having missing data. For the dataset we are using, there are no obviously missing data points, however, it is very possible that with evaluation, we could find missing hourly data from a power outage or some other phenomenon. This is where the gap-filling functions TimescaleDB offers could come in handy. When using algorithms, missing data can often have significant negative impacts on the accuracy or dependability of the model. Sometimes, you can navigate this problem by filling in missing data with reasonable estimates and TimescaleDB actually has built-in functions to help you do this. &lt;/p&gt;

&lt;p&gt;For example, let’s say that you are modeling the energy usage over individual days of the week and a handful of days have missing energy data due to a power outage or an issue with the sensor. We could remove the data, or try to fill in the missing values with reasonable estimations. For today, let’s assume that the model I want to use would benefit more from filling in the missing values. &lt;/p&gt;

&lt;p&gt;As an example, I created some data. I called this table energy_data and it is missing both time and energy readings for the timestamps between 7:45am and 11:30am.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;time&lt;/th&gt;
&lt;th&gt;energy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 07:00:00.000&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 07:15:00.000&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 07:30:00.000&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 07:45:00.000&lt;/td&gt;
&lt;td&gt;0.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 11:30:00.000&lt;/td&gt;
&lt;td&gt;0.04&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 11:45:00.000&lt;/td&gt;
&lt;td&gt;0.04&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 12:00:00.000&lt;/td&gt;
&lt;td&gt;0.03&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 12:15:00.000&lt;/td&gt;
&lt;td&gt;0.02&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 12:30:00.000&lt;/td&gt;
&lt;td&gt;0.03&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 12:45:00.000&lt;/td&gt;
&lt;td&gt;0.02&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 13:00:00.000&lt;/td&gt;
&lt;td&gt;0.03&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I can use TimescaleDB’s &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=gapfilling-docs" rel="noopener noreferrer"&gt;gapfilling hyperfunctions&lt;/a&gt; to fill in these missing values. The &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/interpolate/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=interpolate-docs" rel="noopener noreferrer"&gt;&lt;code&gt;interpolate()&lt;/code&gt;&lt;/a&gt; function is another one of TimescaleDB’s hyperfunctions and it creates data points that follow a linear approximation given the data points before and after the missing range of data. Alternatively, you could use the &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/locf/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=locf-docs" rel="noopener noreferrer"&gt;&lt;code&gt;locf()&lt;/code&gt;&lt;/a&gt; hyperfunction which carries the last recorded value forward to fill in the gap (note that locf stands for last-one-carried-forward). Both of these functions must be used in conjunction with the &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/time_bucket_gapfill/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=time-bucket-gapfilling-docs" rel="noopener noreferrer"&gt;&lt;code&gt;time_bucket_gapfill()&lt;/code&gt;&lt;/a&gt; function. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="c1"&gt;--here I specified that the data should increment by 15 mins&lt;/span&gt;
  &lt;span class="n"&gt;time_bucket_gapfill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'15 min'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;interpolate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
  &lt;span class="n"&gt;locf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_data&lt;/span&gt;
&lt;span class="c1"&gt;--to use gapfill, you will have to take out any time data associated with null values. You can do this using the IS NOT NULL statement&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;energy&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2021-01-01 07:00:00.000'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;'2021-01-01 13:00:00.000'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;timestamp&lt;/th&gt;
&lt;th&gt;interpolate&lt;/th&gt;
&lt;th&gt;locf&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 07:00:00.000&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;0.10000000000000000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 07:30:00.000&lt;/td&gt;
&lt;td&gt;0.15&lt;/td&gt;
&lt;td&gt;0.15000000000000000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 08:00:00.000&lt;/td&gt;
&lt;td&gt;0.13625&lt;/td&gt;
&lt;td&gt;0.15000000000000000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 08:30:00.000&lt;/td&gt;
&lt;td&gt;0.1225&lt;/td&gt;
&lt;td&gt;0.15000000000000000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 09:00:00.000&lt;/td&gt;
&lt;td&gt;0.10875&lt;/td&gt;
&lt;td&gt;0.15000000000000000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 09:30:00.000&lt;/td&gt;
&lt;td&gt;0.095&lt;/td&gt;
&lt;td&gt;0.15000000000000000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 10:00:00.000&lt;/td&gt;
&lt;td&gt;0.08125&lt;/td&gt;
&lt;td&gt;0.15000000000000000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 10:30:00.000&lt;/td&gt;
&lt;td&gt;0.0675&lt;/td&gt;
&lt;td&gt;0.15000000000000000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 11:00:00.000&lt;/td&gt;
&lt;td&gt;0.05375&lt;/td&gt;
&lt;td&gt;0.15000000000000000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 11:30:00.000&lt;/td&gt;
&lt;td&gt;0.04&lt;/td&gt;
&lt;td&gt;0.04000000000000000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 12:00:00.000&lt;/td&gt;
&lt;td&gt;0.025&lt;/td&gt;
&lt;td&gt;0.02500000000000000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021-01-01 12:30:00.000&lt;/td&gt;
&lt;td&gt;0.025&lt;/td&gt;
&lt;td&gt;0.02500000000000000000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Python code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;energy_test_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_test_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;energy_test_df_locf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;energy_test_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;15 min&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ffill&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;energy_test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;energy_test_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;15 min&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;interpolate&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;energy_test_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;locf&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;energy_test_df_locf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;energy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energy_test_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Bonus:&lt;/strong&gt;&lt;br&gt;
The following query is how I could ignore the missing data. I wanted to include this to show you just how easy it can be to exclude null data. Alternatively, I could use a &lt;code&gt;WHERE&lt;/code&gt; clause to specify the times which I could like to ignore (the second query).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_data&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;energy&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_data&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="s1"&gt;'2021-01-01 07:45:00.000'&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2021-01-01 11:30:00.000'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Wrap Up
&lt;/h1&gt;

&lt;p&gt;After reading through these various cleaning techniques, I hope you feel more comfortable with exploring some of the possibilities that PostgreSQL and TimescaleDB provide. By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks a single time rather than repetitively within a script, thus saving me time in the long run. If you are looking to save time and effort while cleaning your data for analysis, definitely consider using PostgreSQL and TimescaleDB. &lt;/p&gt;

&lt;p&gt;In my next posts, I will go over techniques on how to transform data using PostgreSQL and TimescaleDB. I'll then take everything we've learned together to benchmark data munging tasks in PostgreSQL and TimescaleDB vs. Python and pandas. The final blog post will walk you through the full process on a real dataset by conducting a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).&lt;/p&gt;

&lt;p&gt;If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our &lt;a href="https://slack.timescale.com/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=slack" rel="noopener noreferrer"&gt;community Slack&lt;/a&gt;, where you'll find an active community of time-series enthusiasts and various Timescale team members (including me!).&lt;/p&gt;

&lt;p&gt;If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can &lt;a href="https://www.timescale.com/timescale-signup/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-signup" rel="noopener noreferrer"&gt;sign up for a free 30-day trial&lt;/a&gt; or &lt;a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-self-hosted" rel="noopener noreferrer"&gt;install TimescaleDB and manage it on your current PostgreSQL instances&lt;/a&gt;. We also have a bunch of &lt;a href="https://docs.timescale.com/timescaledb/latest/tutorials/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-tutorials" rel="noopener noreferrer"&gt;great tutorials&lt;/a&gt; to help get you started.&lt;/p&gt;

&lt;p&gt;Until next time!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Functionality Glossary:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adding columns together&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TO_NUMBER()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;VIEW&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;WHERE&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;EXTRACT()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CASE&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;time_bucket()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;JOIN&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;AS&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CREATE TABLE&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;create_hypertable()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;INSERT INTO&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;time_bucket_gapfill()&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>datascience</category>
      <category>database</category>
      <category>analytics</category>
      <category>postgres</category>
    </item>
    <item>
      <title>PostgreSQL vs Python for data evaluation: what, why, and how</title>
      <dc:creator>mirandaauhl</dc:creator>
      <pubDate>Thu, 07 Oct 2021 20:15:11 +0000</pubDate>
      <link>https://dev.to/tigerdata/postgresql-vs-python-for-data-evaluation-what-why-and-how-1e3j</link>
      <guid>https://dev.to/tigerdata/postgresql-vs-python-for-data-evaluation-what-why-and-how-1e3j</guid>
      <description>&lt;h2&gt;
  
  
  Table of contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;SQL basics&lt;/li&gt;
&lt;li&gt;A quick note on the data&lt;/li&gt;
&lt;li&gt;Evaluating the data&lt;/li&gt;
&lt;li&gt;Wrap up&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Introduction &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;As I started writing this post, I realized that to properly show how to evaluate, clean, and transform data in the database (also known as data munging), I needed to focus on each step individually. This blog post will show you exactly how to use TimescaleDB and PostgreSQL to perform your &lt;strong&gt;data evaluation&lt;/strong&gt; tasks that you may have previously done in Excel, R, or Python. TimescaleDB and PostgreSQL cannot replace these tools entirely, but they can help your data munging/evaluation tasks be more efficient and, in turn, let Excel, R, and Python shine where they do best: in visualizations, modeling, and machine learning.  &lt;/p&gt;

&lt;p&gt;You may be asking yourself, “What exactly do you mean by Evaluating the data?”. When I talk about evaluating the data, I mean &lt;em&gt;really&lt;/em&gt; understanding the data set you are working with. &lt;/p&gt;

&lt;p&gt;If - in a theoretical world - I could grab a beer with my data set and talk to it about everything, that is what I would do during the evaluating step of my data analysis process. Before beginning analysis, I want to know every column, every general trend, every connection between tables, etc. To do this, I have to sit down and run query after query to get a solid picture of my data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recap
&lt;/h3&gt;

&lt;p&gt;If you remember, &lt;a href="https://blog.timescale.com/blog/speeding-up-data-analysis/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=previous-blog-post" rel="noopener noreferrer"&gt;in my last post&lt;/a&gt;, I summarized the analysis process as the “data analysis lifecycle” with the following steps: Evaluate, Clean, Transform, and Model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrl4941fhg2mr8zdwyox.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrl4941fhg2mr8zdwyox.jpeg" alt=" Image showing Evaluate -&amp;gt; Clean -&amp;gt; Transform -&amp;gt; Model, accompanied by icons which relate to each step" width="800" height="202"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As a data analyst, I found that all the tasks I performed could be grouped into these four categories, with evaluating the data as the first and I feel the most crucial step in the process. &lt;/p&gt;

&lt;p&gt;My first job out of college was at an energy and sustainability solutions company that focused on monitoring all different kinds of usage - such as electricity, water, sewage, you name it - to figure out how buildings could be more efficient. They would place sensors on whatever medium you wanted to monitor to help you figure out what initiatives your group could take to be more sustainable and ultimately save costs. My role at this company was to perform data analysis and business intelligence tasks.&lt;/p&gt;

&lt;p&gt;Throughout my time in this job, I got the chance to use many popular tools to evaluate my data, including Excel, R, Python, and heck, even Minitab. But once I tried using a database - and specifically PostgreSQL and TimescaleDB - I realized how efficient and straightforward evaluating work could be when done directly in a database. Lines of code that took me a while to hunt down online, trying to figure out how to accomplish with pandas, could be done intuitively through SQL. Plus, the database queries were just as fast, if not faster, than my other code most of the time. &lt;/p&gt;

&lt;p&gt;Now, while I would love to show you a one-to-one comparison of my SQL code against each of these popular tools, that’s not practical. Besides, no one wants to read three examples of the same thing in a row! Thus, for comparison purposes in this blog post, I will directly show TimescaleDB and PostgreSQL functionality against Python code. Keep in mind that almost all code will likely be comparable to your Excel and R code. However, if you have any questions, feel free to hop on and join our &lt;a href="https://slack.timescale.com/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=slack" rel="noopener noreferrer"&gt;Slack channel&lt;/a&gt;, where you can ask the Timescale community, or me, specifics on TimescaleDB or PostgreSQL functionality 😊. I’d love to hear from you!&lt;/p&gt;

&lt;p&gt;Additionally, as we explore TimescaleDB and PostgreSQL functionality together, you may be eager to try things out right away! Which is awesome! If so, you can &lt;a href="https://www.timescale.com/timescale-signup/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-sign-up" rel="noopener noreferrer"&gt;sign up for a free 30-day trial&lt;/a&gt; or &lt;a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=self-hosted" rel="noopener noreferrer"&gt;install and manage TimescaleDB on your own PostgreSQL instances&lt;/a&gt;. (You can also learn more by &lt;a href="https://docs.timescale.com/timescaledb/latest/tutorials/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=tutorials" rel="noopener noreferrer"&gt;following one of our many tutorials&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;But enough of an intro, let’s get into the good stuff!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flvhh8lgdxkcsf1dvvivz.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flvhh8lgdxkcsf1dvvivz.gif" alt="Schitts Creek gif: Let's do it" width="480" height="480"&gt;&lt;/a&gt; &lt;/p&gt;
&lt;center&gt;&lt;a href="https://giphy.com/gifs/cbc-schitts-creek-QvwCVnX9DWdlHCnix5" rel="noopener noreferrer"&gt;via GIPHY&lt;/a&gt;&lt;/center&gt;
&lt;h2&gt;
  
  
  SQL basics &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;PostgreSQL is a database platform that uses SQL syntax to interact with the data inside it. TimescaleDB is an extension that is applied to a PostgreSQL database. To unlock the potential of PostgreSQL and TimescaleDB, you have to use SQL. So, before we jump into things, I wanted to give a basic SQL syntax refresher. If you are familiar with SQL, please feel free to skip this section!&lt;/p&gt;

&lt;p&gt;For those of you who are newer to SQL (short for structured query language), it is the language many relational databases, including PostgreSQL, use to query data. Like &lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html" rel="noopener noreferrer"&gt;pandas’ DataFrames&lt;/a&gt; or Excel’s spreadsheets, data queried with SQL is structured as a table with columns and rows.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.postgresql.org/docs/current/sql-select.html" rel="noopener noreferrer"&gt;basics of a SQL SELECT command&lt;/a&gt; can be broken down like this 👇&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="c1"&gt;--columns, functions, aggregates, expressions that describe what you want to be shown in the results&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="c1"&gt;--if selecting data from a table in your DB, you must define the table name here&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="c1"&gt;--join another table to the FROM statement table &lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="c1"&gt;--a column that each table shares values &lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="c1"&gt;--statement to filter results where a column or expression is equivalent to some statement&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="c1"&gt;--if SELECT or WHERE statement contains an aggregate, or if you want to group values on a column/expression, must include columns here&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="c1"&gt;--similar to WHERE, this keyword helps to filter results based upon columns or expressions specifically used with a GROUP BY query&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="c1"&gt;--allows you to specify the order in which your data is displayed&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="c1"&gt;--lets you specify the number of rows you want displayed in the output&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can think of your queries as &lt;em&gt;SELECTing&lt;/em&gt; data &lt;em&gt;FROM&lt;/em&gt; your tables within your database. You can &lt;em&gt;JOIN&lt;/em&gt; multiple tables together and specify &lt;em&gt;WHERE&lt;/em&gt; your data needs to be filtered or what it should be &lt;em&gt;GROUPed BY&lt;/em&gt;. Do you see what I did there 😋?&lt;/p&gt;

&lt;p&gt;This is the beauty of SQL; these keyword's names were chosen to make your queries intuitive. Thankfully, most of PostgreSQL and SQL functionality follow this same easy-to-read pattern. I have had the opportunity to teach myself many programming languages throughout my career, and SQL is by far the easiest to read, write, and construct. This intuitive nature is another excellent reason why data munging in PostgreSQL and TimescaleDB can be so efficient when compared to other methods. &lt;/p&gt;

&lt;p&gt;Note that this list of keywords includes most of the ones you will need to start selecting data with SQL; however, it is not exhaustive. You will not need to use all these phrases for every query but likely will need at least &lt;code&gt;SELECT&lt;/code&gt; and &lt;code&gt;FROM&lt;/code&gt;. The queries in this blog post will always include these two keywords.&lt;/p&gt;

&lt;p&gt;Additionally, the order of these keywords is specific. When building your queries, you need to follow the order that I used above. For any additional PostgreSQL commands you wish to use, you will have to research where they fit in the order hierarchy and follow that accordingly. &lt;/p&gt;

&lt;p&gt;Seeing a list of commands may be somewhat helpful but is likely not enough to solidify understanding if you are like me. So let’s look at some examples!&lt;/p&gt;

&lt;p&gt;Let’s say that I have a table in my PostgreSQL database called &lt;code&gt;energy_usage&lt;/code&gt;. This table contains three columns: &lt;code&gt;time&lt;/code&gt; which contains timestamp values, &lt;code&gt;energy&lt;/code&gt; which contains numeric values, and &lt;code&gt;notes&lt;/code&gt; which contains string values. As you may be able to imagine, every row of data in my &lt;code&gt;energy&lt;/code&gt; table will contain,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;time&lt;/code&gt;: timestamp value saying when the reading was collected&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;energy&lt;/code&gt;: numeric value representing how much energy was used since the last reading&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;notes&lt;/code&gt;: string value giving additional context to each reading. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If I wanted to look at all the data within the table, I could use the following SQL query&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;energy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt; &lt;span class="c1"&gt;--I list my columns here&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_usage&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="c1"&gt;-- I list my table here and end query with semi-colon&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alternatively, SQL has a shorthand for ‘include all columns’, the operator &lt;code&gt;*&lt;/code&gt;. So I could select all the data using this query as well,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_usage&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What if I want to select the data and order it by the &lt;code&gt;time&lt;/code&gt; column so that the earliest readings are first and the latest are last? All I need to do is include the &lt;code&gt;ORDER BY&lt;/code&gt; statement and then specify the &lt;code&gt;time&lt;/code&gt; column along with the specification &lt;code&gt;ASC&lt;/code&gt; to let the database know I want the data in ascending order.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;energy_usage&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="c1"&gt;-- first I list my time column then I specify either DESC or ASC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hopefully, you can start to see the pattern and feel more comfortable with SQL syntax. I will be showing a lot more code snippets throughout the post, so hang tight if you still need more examples!&lt;br&gt;
So now that we have a little refresher on SQL basics, let’s jump into how you can use this language along with TimescaleDB and PostgreSQL functionality to do your data evaluating tasks!&lt;/p&gt;
&lt;h2&gt;
  
  
  A quick note on the data &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Earlier I talked about my first job as a data analyst for an IoT sustainability company. Because of this job, I tend to love IoT data sets and couldn’t pass up the chance to explore &lt;a href="https://www.kaggle.com/srinuti/residential-power-usage-3years-data-timeseries" rel="noopener noreferrer"&gt;this IoT dataset from Kaggle&lt;/a&gt; to show how to perform data munging tasks in PostgreSQL and TimescaleDB. &lt;/p&gt;

&lt;p&gt;The data set contains two tables, one specifying energy consumption for a single home in Houston, Texas (called &lt;code&gt;power_usage&lt;/code&gt;), and the other documenting weather conditions (called &lt;code&gt;weather&lt;/code&gt;). This data is actually the same data set that I used in my previous post, so bonus points if you caught that 😊!&lt;/p&gt;

&lt;p&gt;This data was recorded from January 2016 to December 2020. While looking at this data set, and all time-series data sets, we must consider any outside influences that could affect the data. The most obvious factor that impacts the analysis of this dataset is the COVID-19 pandemic that took place from January 9th through to December 2020. Thankfully, we will see that the individual recording this data included some notes to help categorize days affected by the pandemic. As I go through this blog series, we will see patterns associated with the data collected during the COVID-19 pandemic, so definitely keep this fact in the back of your mind as we perform various data munging analysis steps!&lt;/p&gt;

&lt;p&gt;Here is an image explaining the two tables, their column names in red and corresponding data types in blue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F313g4nalb4viaenwccq9.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F313g4nalb4viaenwccq9.jpg" alt="explanation of the the power_usage table and weather table. The power table has four columns: startdate (timestamp), value_kWh (numeric), day_of_week (int),  notes (varchar). Weather has date (date), day (int), temp_max (numeric), temp_avg (numeric), temp_min (numeric), dew_max (numeric), dew_avg (numeric), dew_min (numeric), hum_max   (numeric), hum_avg  (numeric), hum_min  (numeric), wind_max (numeric), wind_avg (numeric), wind_min (numeric), press_max (numeric), press_avg (numeric), press_min (numeric), precipit (numeric), day_of_week (int)" width="800" height="795"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we work through this blog post, we will use the evaluating techniques available within PostgreSQL and TimescaleDB to understand these two tables inside and out.&lt;/p&gt;
&lt;h2&gt;
  
  
  Evaluating the data &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;As we discussed before, the first step in the data analysis lifecycle - and arguably the most critical step -  is to evaluate the data. I will go through how I would approach evaluating this IoT energy data, showing most of the techniques I have used in the past while working in data science. While these examples are not exhaustive, they will cover many of the evaluating steps you perform during your analysis, helping to make your evaluating tasks more efficient by using PostgreSQL and TimescaleDB. &lt;/p&gt;

&lt;p&gt;The techniques that I will cover include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reading the raw data&lt;/li&gt;
&lt;li&gt;Finding and observing “categorical” column values in my dataset&lt;/li&gt;
&lt;li&gt;Sorting my data by specific columns&lt;/li&gt;
&lt;li&gt;Displaying grouped data&lt;/li&gt;
&lt;li&gt;Finding abnormalities in the database&lt;/li&gt;
&lt;li&gt;Looking at general trends&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Reading the raw data &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Let’s start with the most simple evaluating task, looking at the raw data.&lt;/p&gt;

&lt;p&gt;As we learned in the SQL refresher above, we can quickly pull all the data within a table by using the &lt;code&gt;SELECT&lt;/code&gt; statement with the &lt;code&gt;*&lt;/code&gt; operator. Since I have two tables within my database, I will query both table’s information by running a query for each.&lt;/p&gt;

&lt;p&gt;PostgreSQL code: &lt;a&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- select all the data from my power_usage table&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;power_usage&lt;/span&gt; &lt;span class="n"&gt;pu&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 
&lt;span class="c1"&gt;-- selects all the data from my weather table&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But what if I don’t necessarily need to query all my data? Since all the data is housed in the database, if I want to get a feel for the data and the column values, I could just look at a snapshot of the raw data. &lt;/p&gt;

&lt;p&gt;While conducting analysis in Python, I often would just print a handful of rows of data to get a feel for the values. We can do this in PostgreSQL by including the &lt;code&gt;LIMIT&lt;/code&gt; command within our query. To show the first 20 rows of data in my tables, I can do the following:&lt;/p&gt;

&lt;p&gt;PostgreSQL code: &lt;a&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- select all the data from my power_usage table&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;power_usage&lt;/span&gt; &lt;span class="n"&gt;pu&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;-- specify 20 because I only want to see 20 rows of data&lt;/span&gt;
&lt;span class="c1"&gt;-- selects all the data from my weather table&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; 
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results: Some of the rows for each table&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;startdate&lt;/th&gt;
&lt;th&gt;value_kwh&lt;/th&gt;
&lt;th&gt;day_of_week&lt;/th&gt;
&lt;th&gt;notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 01:00:00&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 02:00:00&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 03:00:00&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 04:00:00&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 05:00:00&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 06:00:00&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;date&lt;/th&gt;
&lt;th&gt;day&lt;/th&gt;
&lt;th&gt;temp_max&lt;/th&gt;
&lt;th&gt;temp_avg&lt;/th&gt;
&lt;th&gt;temp_min&lt;/th&gt;
&lt;th&gt;dew_max&lt;/th&gt;
&lt;th&gt;dew_avg&lt;/th&gt;
&lt;th&gt;dew_min&lt;/th&gt;
&lt;th&gt;hum_max&lt;/th&gt;
&lt;th&gt;hum_avg&lt;/th&gt;
&lt;th&gt;hum_min&lt;/th&gt;
&lt;th&gt;wind_max&lt;/th&gt;
&lt;th&gt;wind_avg&lt;/th&gt;
&lt;th&gt;wind_min&lt;/th&gt;
&lt;th&gt;press_max&lt;/th&gt;
&lt;th&gt;press_avg&lt;/th&gt;
&lt;th&gt;press_min&lt;/th&gt;
&lt;th&gt;precipit&lt;/th&gt;
&lt;th&gt;day_of_week&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;td&gt;71&lt;/td&gt;
&lt;td&gt;66&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;89&lt;/td&gt;
&lt;td&gt;65&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-02-06&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;76&lt;/td&gt;
&lt;td&gt;71&lt;/td&gt;
&lt;td&gt;66&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;td&gt;66&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;97&lt;/td&gt;
&lt;td&gt;89&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-02-07&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;95&lt;/td&gt;
&lt;td&gt;86&lt;/td&gt;
&lt;td&gt;76&lt;/td&gt;
&lt;td&gt;76&lt;/td&gt;
&lt;td&gt;73&lt;/td&gt;
&lt;td&gt;69&lt;/td&gt;
&lt;td&gt;94&lt;/td&gt;
&lt;td&gt;67&lt;/td&gt;
&lt;td&gt;43&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-02-08&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;97&lt;/td&gt;
&lt;td&gt;87&lt;/td&gt;
&lt;td&gt;77&lt;/td&gt;
&lt;td&gt;77&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;td&gt;71&lt;/td&gt;
&lt;td&gt;94&lt;/td&gt;
&lt;td&gt;66&lt;/td&gt;
&lt;td&gt;43&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-02-09&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;95&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;td&gt;77&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;td&gt;90&lt;/td&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;td&gt;51&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-02-10&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;86&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;td&gt;65&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;61&lt;/td&gt;
&lt;td&gt;58&lt;/td&gt;
&lt;td&gt;90&lt;/td&gt;
&lt;td&gt;66&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-03-06&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;79&lt;/td&gt;
&lt;td&gt;72&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;td&gt;72&lt;/td&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;94&lt;/td&gt;
&lt;td&gt;72&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Python code:&lt;/p&gt;

&lt;p&gt;In this first Python code snippet, I show the modules I needed to import and the connection code that I would have to run to access the data from my database and import it into a pandas DataFrame. &lt;/p&gt;

&lt;p&gt;One of the challenges I faced while data munging in Python was the need to run through the entire script again and again when evaluating, cleaning, and transforming the data. This initial data pulling process usually takes a good bit of time, so it was often frustrating to run through it repetitively. I also would have to run print anytime I wanted to quickly glance at an array, Dataframe, or element. These kinds of extra tasks in Python can be time-consuming, especially if you end up at the modeling stage of the analysis lifecycle with only a subset of the original data! All this to say, keep in mind that for the other code snippets within the blog, I will not include this as part of the code; however, it still impacts that code in the background. &lt;/p&gt;

&lt;p&gt;Additionally, because I have my data housed in a TimescaleDB instance, I still need to use the &lt;code&gt;SELECT&lt;/code&gt; statement to query the data from the database and read it into Python. If you use a relational database - which I explained is very beneficial to analysis in my previous post - you will have to use &lt;em&gt;some&lt;/em&gt; SQL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;configparser&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tempfile&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="c1"&gt;## use config file for database connection information
&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;configparser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ConfigParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;env.ini&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;## establish conntection
&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USERINFO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DB_NAME&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                       &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USERINFO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;HOST&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                       &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USERINFO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USER&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                       &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USERINFO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PASS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                       &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USERINFO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PORT&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;span class="c1"&gt;## define the queries for copying data out of our database (using format to copy queries)                    
&lt;/span&gt;&lt;span class="n"&gt;query_weather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;select * from weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;query_power&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;select * from power_usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="c1"&gt;## define function to copy the data to a csv
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;copy_from_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tempfile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TemporaryFile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tmpfile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;copy_sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;COPY ({query}) TO STDOUT WITH CSV {head}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HEADER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy_expert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;copy_sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tmpfile&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tmpfile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmpfile&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;span class="c1"&gt;## create cursor to use in function above and place data into a file
&lt;/span&gt;&lt;span class="n"&gt;cursor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;weather_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;copy_from_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_weather&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;power_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;copy_from_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_power&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weather_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Finding and observing “categorical” column values in my dataset &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Next, I think it is essential to understand any “categorical” columns - columns with a finite set of values - that I might have. This is useful in analysis because categorical data can give insight into natural groupings that often occur within a dataset. For example, I would assume that energy usage for many people is different on a weekday vs. a weekend. We can’t verify this without knowing the categorical possibilities and seeing how each could impact the data trend. &lt;/p&gt;

&lt;p&gt;First, I want to look at my tables and the data types used for each column. Looking at the available columns in each table, I can make an educated guess that the &lt;code&gt;day_of_week&lt;/code&gt;, &lt;code&gt;notes&lt;/code&gt;, and &lt;code&gt;day&lt;/code&gt; columns will be categorical. Let’s find out if they indeed are and how many different values exist in each. &lt;/p&gt;

&lt;p&gt;To find all the distinct values within a column (or between multiple columns), you can use the &lt;code&gt;DISTINCT&lt;/code&gt; keyword after &lt;code&gt;SELECT&lt;/code&gt; in your query statement. This can be useful for several data munging tasks, such as identifying categories - which I need to do - or finding unique sets of data. &lt;/p&gt;

&lt;p&gt;Since I want to look at the unique values within each column individually, I will run a query for each separately. If I were to run a query like this 👇&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;power_usage&lt;/span&gt; &lt;span class="n"&gt;pu&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I would get data like this&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;day_of_week&lt;/th&gt;
&lt;th&gt;notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;vacation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;vacation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;vacation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;vacation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The output data would show unique &lt;em&gt;pairs&lt;/em&gt; of &lt;code&gt;day_of_week&lt;/code&gt; and &lt;code&gt;notes&lt;/code&gt; related values within the table. This is why I need to include a single column in each statement so that I only see that individual column’s unique values and not the unique sets of values. &lt;/p&gt;

&lt;p&gt;For these queries, I am also going to include the &lt;code&gt;ORDER BY&lt;/code&gt; command to show the values of each column in ascending order.&lt;/p&gt;

&lt;p&gt;PostgreSQL code: &lt;a&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- selecting distinct values in the ‘day_of_week’ column within my power_usage table&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;power_usage&lt;/span&gt; &lt;span class="n"&gt;pu&lt;/span&gt; 
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- selecting distinct values in the ‘notes’ column within my power_usage table&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;power_usage&lt;/span&gt; &lt;span class="n"&gt;pu&lt;/span&gt; 
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- selecting distinct values in the ‘day’ column within my weather table&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="nv"&gt;"day"&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; 
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nv"&gt;"day"&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- selecting distinct values in the ‘day_of_week’ column within my weather table&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; 
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results:&lt;/p&gt;

&lt;p&gt;Notice that we see the recorder for this data included “COVID-19” as a category in their &lt;code&gt;notes&lt;/code&gt; column. As mentioned above, this note could be necessary to finding and understanding patterns in this family's energy usage.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;day_of_week&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;COVID_lockdown&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vacation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;weekend&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;(Only some of the values shown for day)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;day&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Python code:&lt;br&gt;
In my Python code, notice that I need to print anything that I want to quickly observe. I have found this to be the quickest solution, even when compared to using the Python console in debug mode.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;p_day_of_the_week&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day_of_week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;p_notes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;w_day&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weather_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;w_day_of_the_week&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day_of_week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_day_of_the_week&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_notes&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w_day&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w_day_of_the_week&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sorting my data by specific columns &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;What if I want to evaluate my tables based on how specific columns were sorted? One of the top questions asked on StackOverflow for Python data analysis is &lt;a href="https://stackoverflow.com/questions/17141558/how-to-sort-a-dataframe-in-python-pandas-by-two-or-more-columns" rel="noopener noreferrer"&gt;"How to sort a dataframe in python pandas by two or more columns?"&lt;/a&gt;. Once again, we can do this intuitively through SQL.&lt;/p&gt;

&lt;p&gt;One of the things I'm interested in identifying is how bad weather impacts energy usage. To do this, I have to think about indicators that typically signal bad weather, which include high precipitation, high wind speed, and low pressure. To identify days with this pattern in my PostgreSQL &lt;code&gt;weather&lt;/code&gt; table, I need to use the &lt;code&gt;ORDER BY&lt;/code&gt; keyword, then call out each column in the order I want things sorted, specifying the &lt;code&gt;DESC&lt;/code&gt; and &lt;code&gt;ASC&lt;/code&gt; attributes as needed. &lt;/p&gt;

&lt;p&gt;PostgreSQL code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- sort weather data by precipitation desc first, wind_avg desc second, and pressure asc third&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="nv"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;precipit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wind_avg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;press_avg&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; 
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;precipit&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wind_avg&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;press_avg&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;date&lt;/th&gt;
&lt;th&gt;precipit&lt;/th&gt;
&lt;th&gt;wind_avg&lt;/th&gt;
&lt;th&gt;press_avg&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2017-08-27&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2017-08-28&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2019-09-20&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2017-08-08&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2017-08-29&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2018-08-12&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-02-06&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2018-05-07&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2019-10-05&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2018-03-29&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-03-06&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2018-06-19&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2019-08-05&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2019-10-30&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Python code:&lt;/p&gt;

&lt;p&gt;I have often found the different pandas or Python functions to be harder to know off the top of my head. With how popular the StackOverflow question is, I can imagine that many of you also had to refer to Google for how to do this initially.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sorted_weather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weather_df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;precipit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wind_avg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;press_avg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;precipit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wind_avg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;press_avg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sorted_weather&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Displaying grouped data &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Finding the sum of energy usage from data that records energy per hour can be instrumental in understanding data patterns. This concept boils down to performing a type of aggregation over a particular column. Between PostgreSQL and TimescaleDB, we have access to almost every type of aggregation function we could need. I will show some of these operators in this blog series, but I strongly encourage all of you to &lt;a href="https://www.postgresql.org/docs/current/functions-aggregate.html" rel="noopener noreferrer"&gt;lookup more&lt;/a&gt; for your own use!&lt;/p&gt;

&lt;p&gt;From the categorical section earlier, I mentioned that I suspect people could have different energy behavior patterns on weekdays vs. weekends, particularly in a single-family home in the US. Given my data set, I’m curious about this hypothesis and want to find the cumulative energy consumption across each day of the week. &lt;/p&gt;

&lt;p&gt;To do so, I need to sum all the kWh data (&lt;code&gt;value_kwh&lt;/code&gt;) in the power table, then group this data by the day of the week (&lt;code&gt;day_of_week&lt;/code&gt;). In order to sum my data in PostgreSQL, I will use the &lt;code&gt;SUM()&lt;/code&gt; function. Because this is an aggregation function, I will have to include something that tells the database what to sum over. Since I want to know the sum of energy over each type of day, I can specify that the sum should be grouped by the &lt;code&gt;day_of_week&lt;/code&gt; column using the &lt;code&gt;GROUP BY&lt;/code&gt; keyword. I also added the &lt;code&gt;ORDER BY&lt;/code&gt; keyword so that we could look at the weekly summed usage in order of the day. &lt;/p&gt;

&lt;p&gt;PostgreSQL code: &lt;a&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- first I select the day_of_week col, then I define SUM(value_kwn) to get the sum of value_kwh col&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value_kwh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;--sum the value_sum column&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;power_usage&lt;/span&gt; &lt;span class="n"&gt;pu&lt;/span&gt; 
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt; &lt;span class="c1"&gt;-- group by the day_of_week col&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;-- decided to order data by the day_of_week asc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results:&lt;/p&gt;

&lt;p&gt;After some quick investigation, the value &lt;code&gt;0&lt;/code&gt; in the &lt;code&gt;day_of_week&lt;/code&gt; column represents a Monday, thus my hypothesis may just be right. &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;day_of_week&lt;/th&gt;
&lt;th&gt;sum&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;3849&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;3959&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3947&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;4094&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;3987&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;4169&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;4311&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Python code:&lt;/p&gt;

&lt;p&gt;Something to note about the pandas &lt;code&gt;groupby()&lt;/code&gt; function is that the group by column in the DataFrame will become the index column in the resulting aggregated DataFrame. This can add some extra work later on.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;day_agg_power&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day_of_week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value_kwh&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sum&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;day_agg_power&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Finding abnormalities in the database &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Clean data is fundamental in producing accurate analysis, and abnormalities/errors can be a huge roadblock to clean data. An essential part of evaluating data is finding abnormalities to determine if an error caused them. No data set is perfect, so it is vital to hunt down any possible errors in preparation for the cleaning stage of our analysis. Let's look at one example of how to uncover issues in a dataset using our example energy data.&lt;/p&gt;

&lt;p&gt;After looking at the raw data in my &lt;code&gt;power_usage&lt;/code&gt; table, I found that the &lt;code&gt;notes&lt;/code&gt; and &lt;code&gt;day_of_week&lt;/code&gt; columns &lt;strong&gt;should be the same for each hour across a single day&lt;/strong&gt; (there are 24 hourly readings each day, and each hour is supposed to have the same &lt;code&gt;notes&lt;/code&gt; value). In my experience with data analysis, I have found that notes which need to be recorded granularly often have mistakes within them. Because of this, I wanted to investigate whether or not this pattern was consistent across all of the data.&lt;/p&gt;

&lt;p&gt;To check this hypothesis I can use the TimescaleDB &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=time-bucket#time-bucket/" rel="noopener noreferrer"&gt;&lt;code&gt;time_bucket()&lt;/code&gt;&lt;/a&gt; function, PostgreSQL’s &lt;code&gt;GROUP BY&lt;/code&gt; keyword, and &lt;a href="https://www.postgresql.org/docs/13/queries-with.html" rel="noopener noreferrer"&gt;CTEs&lt;/a&gt; (common table expressions). While the &lt;code&gt;GROUP BY&lt;/code&gt; keyword is likely familiar to you by now, CTEs and the &lt;code&gt;time_bucket()&lt;/code&gt; function are not. So, before I show the query, let’s dive into these two features.&lt;/p&gt;

&lt;h4&gt;
  
  
  Time bucket function
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;time_bucket()&lt;/code&gt; function allows you to take a timestamp column like &lt;code&gt;startdate&lt;/code&gt; in the &lt;code&gt;power_usage&lt;/code&gt; table, and “bucket” the time based on the interval of your choice. For example, &lt;code&gt;startdate&lt;/code&gt; is a timestamp column that shows values for each hour in a day. You could use the &lt;code&gt;time_bucket()&lt;/code&gt; function on this column to “bucket” the hourly data into daily data. &lt;/p&gt;

&lt;p&gt;Here is an image that shows how rows of the &lt;code&gt;startdate&lt;/code&gt; column are bucketed into one aggregate row with &lt;code&gt;time_bucket(‘1 day’, startdate)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1466njhe3eg504kacrus.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1466njhe3eg504kacrus.jpg" alt="Image showing how hourly data from 2016-01-01 00:00:00 - 2016-01-01 23:00:00 is bucketed to 2016-01-01 00:00:00 using the time_bucket() function " width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After using the &lt;code&gt;time_bucket()&lt;/code&gt; function in my query, I will have one unique “date” value for any data recorded over a single day. Since &lt;code&gt;notes&lt;/code&gt; and &lt;code&gt;day_of_week&lt;/code&gt; should also be unique over each day, if I &lt;em&gt;group by&lt;/em&gt; these columns, I should get a single set of (date, day_of_week, notes) values. &lt;/p&gt;

&lt;p&gt;Notice that to use &lt;code&gt;GROUP BY&lt;/code&gt; in this scenario, I just list the columns I want to group on. Also, notice that I added &lt;code&gt;AS&lt;/code&gt; after my &lt;code&gt;time_bucket()&lt;/code&gt; function, this keyword allows you to "rename" columns. In the results, look for the &lt;code&gt;day&lt;/code&gt; column, as this comes directly from my rename. &lt;/p&gt;

&lt;p&gt;PostgreSQL code: &lt;a&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- select the date through time_bucket and get unique values for each &lt;/span&gt;
&lt;span class="c1"&gt;-- (date, day_of_week, notes) set&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;time_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'1 day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;startdate&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;day_of_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;notes&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;power_usage&lt;/span&gt; &lt;span class="n"&gt;pu&lt;/span&gt; 
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results: Some of the rows&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;day&lt;/th&gt;
&lt;th&gt;day_of_week&lt;/th&gt;
&lt;th&gt;notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2017-01-19 00:00:00&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-10-06 00:00:00&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2017-06-04 00:00:00&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;weekend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2019-01-03 00:00:00&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2017-10-01 00:00:00&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;weekend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2019-11-27 00:00:00&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2017-06-15 00:00:00&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-11-16 00:00:00&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2017-05-18 00:00:00&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2018-07-17 00:00:00&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2020-03-06 00:00:00&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2018-10-14 00:00:00&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;weekend&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Python code:&lt;/p&gt;

&lt;p&gt;In my Python code, I cannot just manipulate the table to print results, I actually have to create another column in the DataFrame.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;day_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;startdate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;day_col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;power_unique&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day_of_week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;power_unique&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we understand the &lt;code&gt;time_bucket()&lt;/code&gt; function a little better, let's look at CTEs and how they help me use this bucketed data to find any errors within the &lt;code&gt;notes&lt;/code&gt; column. &lt;/p&gt;

&lt;h4&gt;
  
  
  CTEs or common table expressions
&lt;/h4&gt;

&lt;p&gt;Getting unique sets of data only solves half of my problem. Now I want to verify if each day is truly mapped to a single &lt;code&gt;day_of_week&lt;/code&gt; and &lt;code&gt;notes&lt;/code&gt; pair. This is where CTE’s come in handy. With CTEs, you can build a query based on the results of others. &lt;/p&gt;

&lt;p&gt;CTE’s use the following format 👇&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;query_1&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="c1"&gt;-- columns expressions&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="c1"&gt;--column expressions &lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;query_1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;WITH&lt;/code&gt; and &lt;code&gt;AS&lt;/code&gt; allow you to define the first query, then in the second &lt;code&gt;SELECT&lt;/code&gt; statement, you can call the results from the first query as if it were another table in the database. &lt;/p&gt;

&lt;p&gt;To check that each day was “mapped” to a single &lt;code&gt;day_of_week&lt;/code&gt; and &lt;code&gt;notes&lt;/code&gt; pair, I need to aggregate the queried &lt;code&gt;time_bucket()&lt;/code&gt; table above based upon the date column using another PostgreSQL aggregation function &lt;code&gt;COUNT()&lt;/code&gt;. I am doing this because each day should only count one unique &lt;code&gt;day_of_week&lt;/code&gt; and &lt;code&gt;notes&lt;/code&gt; pair. If the count results in two or more, this implies that one day contains multiple &lt;code&gt;day_of_week&lt;/code&gt; and &lt;code&gt;notes&lt;/code&gt; pairs and thus is showing abnormal data. &lt;/p&gt;

&lt;p&gt;Additionally, I will add a &lt;code&gt;HAVING&lt;/code&gt; statement into my query so that the output only displays rows where the &lt;code&gt;COUNT(day)&lt;/code&gt; is greater than one. I will also throw in an &lt;code&gt;ORDER BY&lt;/code&gt; statement in case we have many different values greater than 1.&lt;/p&gt;

&lt;p&gt;PostgreSQL code: &lt;a&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;power_unique&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="c1"&gt;-- query from above, get unique set of (date, day_of_week, notes)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;time_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;startdate&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;day_of_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;notes&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;power_usage&lt;/span&gt; &lt;span class="n"&gt;pu&lt;/span&gt; 
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;-- calls data from the query above, using the COUNT() agg function&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;power_unique&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;day&lt;/th&gt;
&lt;th&gt;count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2017-12-27 00:00:00&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2020-01-03 00:00:00&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2018-06-02 00:00:00&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2019-06-03 00:00:00&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2020-07-01 00:00:00&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-07-21 00:00:00&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Python code:&lt;/p&gt;

&lt;p&gt;Because of the count aggregation, I needed to rename the column in my &lt;code&gt;agg_power_unique&lt;/code&gt; DataFrame so that I could then sort the values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;day_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;startdate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;## If you ran the previous code snippet, this next line will error since you already ran it
&lt;/span&gt;&lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;day_col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;power_unique&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day_of_week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;agg_power_unique&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;power_unique&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;agg_power_unique&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agg_power_unique&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agg_power_unique&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;agg_power_unique&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query reveals that I indeed have a couple of data points that seem suspicious. Specifically, the dates [2017-12-27, 2020-01-03, 2018-06-02, 2019-06-03, 2020-07-01, 2016-07-21]. I will demonstrate how to fix these date issues in a later blog post about Cleaning techniques.&lt;/p&gt;

&lt;p&gt;This example only shows one set of functions which helped me identify abnormal data through grouping and aggregation. You can use many other PostgreSQL and TimescaleDB functions to find other abnormalities in your data, like utilizing TimescaleDB’s &lt;code&gt;approx_percentile()&lt;/code&gt; function (introducing this next) to find outliers in numeric columns by playing around with interquartile range calculations.&lt;/p&gt;

&lt;h4&gt;
  
  
  Looking at general trends &lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;Arguably, one of the more critical aspects of evaluating your data is understanding the general trends. To do this, you need to get basic statistics on your data using functions like mean, interquartile range, maximum values, and others. TimescaleDB has created many optimized hyperfunctions to perform these very tasks.&lt;/p&gt;

&lt;p&gt;To calculate these values, I am going to introduce the following TimescaleDB functions: &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/approx_percentile/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=approx-percentile" rel="noopener noreferrer"&gt;&lt;code&gt;approx_percentile&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/min_val/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=min-val" rel="noopener noreferrer"&gt;&lt;code&gt;min_val&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/max_val/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=max-val" rel="noopener noreferrer"&gt;&lt;code&gt;max_val&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/mean/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=mean" rel="noopener noreferrer"&gt;&lt;code&gt;mean&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/num_vals/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=num-vals" rel="noopener noreferrer"&gt;&lt;code&gt;num_vals&lt;/code&gt;&lt;/a&gt;,&lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile_agg/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=percentile-agg" rel="noopener noreferrer"&gt;&lt;code&gt;percentile_agg&lt;/code&gt; (aggregate)&lt;/a&gt;, and &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile-aggregation-methods/tdigest/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=tdigest" rel="noopener noreferrer"&gt;&lt;code&gt;tdigest&lt;/code&gt; (aggregate)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These hyperfunctions fall under the TimescaleDB category of two-step aggregation. Timescale designed each function to either be an aggregate or accessor function (I noted which ones above were aggregate functions). In two-step aggregation, the more programmatically taxing aggregate function is calculated first, then the accessor function is applied to it after.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feymhbffv4yst6sjv8vuf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feymhbffv4yst6sjv8vuf.png" alt="accessor_function(aggregate_function())" width="800" height="61"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For specifics on how two-step aggregation works and why we use this convention, check out &lt;a href="https://blog.timescale.com/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design-2/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=hyperfunctions-blog" rel="noopener noreferrer"&gt;David Kohn’s blog series on our hyperfunctions and two-step aggregation&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;I definitely want to understand the basic trends within the &lt;code&gt;power_usage&lt;/code&gt; table for my data set. If I plan to do any type of modeling to predict future usage trends, I need to know some basic information about what this home’s usage looks like daily. &lt;/p&gt;

&lt;p&gt;To understand the daily power usage data distribution, I’ll need to aggregate the energy usage per day. To do this, I can use the &lt;code&gt;time_bucket()&lt;/code&gt; function I mentioned above, along with the &lt;code&gt;SUM()&lt;/code&gt; operator.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- bucket the daily data using time_bucket, sum kWh over each bucketed day&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;time_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;startdate&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value_kwh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;power_usage&lt;/span&gt; &lt;span class="n"&gt;pu&lt;/span&gt; 
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I then want to find the 1st, 10th, 25th, 75th, 90th, and 99th percentiles, the median or 50th percentile, mean, minimum value, maximum value, number of readings in the table, and interquartile range of this data. Creating the query with a CTE simplifies the process by only calculating the sum of data once and reusing the value multiple times.&lt;/p&gt;

&lt;p&gt;PostgreSQL: &lt;a&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;power_usage_sum&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="c1"&gt;-- bucket the daily data using time_bucket, sum kWh over each bucketed day&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;time_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;startdate&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value_kwh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;sum_kwh&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;power_usage&lt;/span&gt; &lt;span class="n"&gt;pu&lt;/span&gt; 
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;-- using two-step aggregation functions to find stats&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;approx_percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;percentile_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_kwh&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nv"&gt;"1p"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;approx_percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;percentile_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_kwh&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nv"&gt;"10p"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;approx_percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;percentile_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_kwh&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nv"&gt;"25p"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;approx_percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;percentile_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_kwh&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nv"&gt;"50p"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;approx_percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;percentile_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_kwh&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nv"&gt;"75p"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;approx_percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;percentile_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_kwh&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nv"&gt;"90p"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;approx_percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;percentile_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_kwh&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nv"&gt;"99p"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;min_val&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tdigest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sum_kwh&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="n"&gt;max_val&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tdigest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sum_kwh&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;percentile_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_kwh&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="n"&gt;num_vals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;percentile_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_kwh&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="c1"&gt;-- you can use subtraction to create an output for the IQR&lt;/span&gt;
&lt;span class="n"&gt;approx_percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;percentile_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_kwh&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;approx_percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;percentile_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_kwh&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;iqr&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;power_usage_sum&lt;/span&gt; &lt;span class="n"&gt;pus&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;1p&lt;/th&gt;
&lt;th&gt;10p&lt;/th&gt;
&lt;th&gt;25p&lt;/th&gt;
&lt;th&gt;50p&lt;/th&gt;
&lt;th&gt;75p&lt;/th&gt;
&lt;th&gt;90p&lt;/th&gt;
&lt;th&gt;99p&lt;/th&gt;
&lt;th&gt;min_val&lt;/th&gt;
&lt;th&gt;max_val&lt;/th&gt;
&lt;th&gt;mean&lt;/th&gt;
&lt;th&gt;num_vals&lt;/th&gt;
&lt;th&gt;iqr&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;td&gt;4.0028&lt;/td&gt;
&lt;td&gt;6.9936&lt;/td&gt;
&lt;td&gt;16.0066&lt;/td&gt;
&lt;td&gt;28.9914&lt;/td&gt;
&lt;td&gt;38.9781&lt;/td&gt;
&lt;td&gt;56.9971&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;td&gt;73.0&lt;/td&gt;
&lt;td&gt;18.9025&lt;/td&gt;
&lt;td&gt;1498.0&lt;/td&gt;
&lt;td&gt;21.9978&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Python:&lt;/p&gt;

&lt;p&gt;Something that really stumped me when initially writing this code snippet was that I had to use &lt;code&gt;astype(float)&lt;/code&gt; on my &lt;code&gt;value_kwh&lt;/code&gt; column to use describe. I have probably spent the combined time of a day over my life trying to deal with value types being incompatible with certain functions. This is another reason why I enjoy data munging with the intuitive functionality of PostgreSQL and TimescaleDB; these types of problems just happen less often. And let me tell you, the faster and painless data munging is the happier I am!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;agg_power&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value_kwh&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sum&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="c1"&gt;# need to make the value_kwh column the right data type
&lt;/span&gt;&lt;span class="n"&gt;agg_power&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value_kwh&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agg_power&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value_kwh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;describe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agg_power&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value_kwh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;percentiles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agg_power&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value_kwh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quantile&lt;/span&gt;&lt;span class="p"&gt;([.&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;99&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;q75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q25&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agg_power&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value_kwh&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;iqr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;q75&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;q25&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;percentiles&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iqr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Another technique you may want to use for accessing the distribution of data in a column is a histogram. Generally, creating an image is where Python and other tools shine. However, I often need to glance at a histogram to check for any blatant anomalies when evaluating data. While this one technique in TimescaleDB may not be as simple as the Python solution, I can still do this directly in my database, which can be convenient. &lt;/p&gt;

&lt;p&gt;To create a histogram in the database, we will need to use the TimescaleDB &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/histogram/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=histogram#required-arguments/" rel="noopener noreferrer"&gt;&lt;code&gt;histogram()&lt;/code&gt;&lt;/a&gt; function, &lt;a href="https://www.postgresql.org/docs/13/functions-array.html" rel="noopener noreferrer"&gt;&lt;code&gt;unnest()&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://www.postgresql.org/docs/13/functions-srf.html" rel="noopener noreferrer"&gt;&lt;code&gt;generate_series()&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://www.postgresql.org/docs/13/functions-string.html" rel="noopener noreferrer"&gt;&lt;code&gt;repeat()&lt;/code&gt;&lt;/a&gt;, and CTE’s.  &lt;/p&gt;

&lt;p&gt;The &lt;code&gt;histogram()&lt;/code&gt; function takes in the column you want to analyze and produces an array object which contains the frequency values across the number of buckets plus two (one additional bucket for values below the lowest bucket and above the highest bucket). You can then use PostgreSQL’s &lt;code&gt;unnest()&lt;/code&gt; function to break up the array into a single column with rows equal to two plus the number of specified buckets. &lt;/p&gt;

&lt;p&gt;Once you have a column with bucket frequencies, you can then create a histogram “image” using the PostgreSQL &lt;code&gt;repeat()&lt;/code&gt; function. The first time I saw someone use the &lt;code&gt;repeat()&lt;/code&gt; function in this way was in &lt;a href="https://hakibenita.com/sql-for-data-analysis" rel="noopener noreferrer"&gt;Haki Benita’s blog post&lt;/a&gt;, which I recommend reading if you are interested in learning more PostgreSQL analytical techniques. The &lt;code&gt;repeat()&lt;/code&gt; function essentially creates a string that repeats chosen characters a specified number of times. To use the histogram frequency values, you just input the unnested histogram in for the repeating argument. &lt;/p&gt;

&lt;p&gt;Additionally, I find it useful to know the approximate starting values for each bucket in the histogram. This gives me a better picture of what values are occurring when. To approximate the bin values, I use the PostgreSQL &lt;code&gt;generate_series()&lt;/code&gt; function along with some algebra,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;number_of_buckets&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;max_val&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;min_val&lt;/span&gt;&lt;span class="p"&gt;]::&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;number_of_buckets&lt;/span&gt;&lt;span class="p"&gt;]::&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;min_val&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When I put all these techniques together, I am able to get a histogram with the following,&lt;/p&gt;

&lt;p&gt;PostgreSQL: &lt;a&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;power_usage_sum&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="c1"&gt;-- bucket the daily data using time_bucket, sum kWh over each bucketed day&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;time_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;startdate&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value_kwh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;sum_kwh&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;power_usage&lt;/span&gt; &lt;span class="n"&gt;pu&lt;/span&gt; 
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;histogram&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="c1"&gt;-- I input the column = sum_kwh, the min value = 28, max value = 90, and number of buckets = 25&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_kwh&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;73&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;power_usage_sum&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; 
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
&lt;span class="c1"&gt;-- I use unnest to create the first column&lt;/span&gt;
   &lt;span class="k"&gt;unnest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;histogram&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="c1"&gt;-- I use my approximate bucket values function&lt;/span&gt;
   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;73&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;approx_bucket_start_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="c1"&gt;-- I then use the repeat function to display the frequency&lt;/span&gt;
   &lt;span class="n"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'■'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;unnest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;histogram&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;frequency&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;histogram&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;count&lt;/th&gt;
&lt;th&gt;approx_bucket_start_val&lt;/th&gt;
&lt;th&gt;frequency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;-2.433333333333333&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;83&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;104&lt;/td&gt;
&lt;td&gt;2.433333333333333&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;207&lt;/td&gt;
&lt;td&gt;4.866666666666666&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;105&lt;/td&gt;
&lt;td&gt;7.3&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;td&gt;9.733333333333333&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;76&lt;/td&gt;
&lt;td&gt;12.166666666666666&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;105&lt;/td&gt;
&lt;td&gt;14.6&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;17.033333333333335&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;48&lt;/td&gt;
&lt;td&gt;19.466666666666665&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;77&lt;/td&gt;
&lt;td&gt;21.9&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;24.333333333333332&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;83&lt;/td&gt;
&lt;td&gt;26.766666666666666&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;29.2&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;72&lt;/td&gt;
&lt;td&gt;31.633333333333333&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;46&lt;/td&gt;
&lt;td&gt;34.06666666666667&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;51&lt;/td&gt;
&lt;td&gt;36.5&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;td&gt;38.93333333333333&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;41.36666666666667&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;43.8&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;46.233333333333334&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;48.666666666666664&lt;/td&gt;
&lt;td&gt;■■■■■■■■■■■■■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;51.1&lt;/td&gt;
&lt;td&gt;■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;53.53333333333333&lt;/td&gt;
&lt;td&gt;■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;55.96666666666667&lt;/td&gt;
&lt;td&gt;■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;58.4&lt;/td&gt;
&lt;td&gt;■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;60.833333333333336&lt;/td&gt;
&lt;td&gt;■■■■■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;63.266666666666666&lt;/td&gt;
&lt;td&gt;■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;65.7&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;68.13333333333334&lt;/td&gt;
&lt;td&gt;■&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;70.56666666666666&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;73.0&lt;/td&gt;
&lt;td&gt;■&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Python:&lt;/p&gt;

&lt;p&gt;This Python code is definitively better. It’s simple and relatively painless. I wanted to show this comparison to provide an option for displaying a histogram directly in your database vs. having to pull the data into a pandas DataFrame then displaying it. Doing the histogram in the database just helps me to keep focus while evaluating the data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agg_power&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value_kwh&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Wrap up &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Hopefully, after reading through these various evaluating techniques, you feel more comfortable with exploring some of the possibilities that PostgreSQL and TimescaleDB provide. Evaluating data directly in the database often saved me time without sacrificing any functionality. If you are looking to save time and effort while evaluating your data for analysis, definitely consider using PostgreSQL and TimescaleDB. &lt;/p&gt;

&lt;p&gt;In my next posts, I will go over techniques to clean and transform data using PostgreSQL and TimescaleDB. I'll then take everything we've learned together to benchmark data munging tasks in PostgreSQL and TimescaleDB vs. Python and pandas. The final blog post will walk you through the full process on a real dataset by conducting deep-dive data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).&lt;/p&gt;

&lt;p&gt;If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our &lt;a href="https://slack.timescale.com/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=slack" rel="noopener noreferrer"&gt;community Slack&lt;/a&gt;, where you'll find an active community of time-series enthusiasts and various Timescale team members (including me!).&lt;/p&gt;

&lt;p&gt;If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can &lt;a href="https://www.timescale.com/timescale-signup/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-sign-up" rel="noopener noreferrer"&gt;sign up for a free 30-day trial&lt;/a&gt; or &lt;a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=self-hosted" rel="noopener noreferrer"&gt;install TimescaleDB and manage it on your current PostgreSQL instances&lt;/a&gt;. We also have a bunch of &lt;a href="https://docs.timescale.com/timescaledb/latest/tutorials/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=tutorials" rel="noopener noreferrer"&gt;great tutorials&lt;/a&gt; to help get you started.&lt;br&gt;
Until next time!&lt;/p&gt;

&lt;h2&gt;
  
  
  Functionality Glossary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;SELECT&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;FROM&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ORDER BY&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DESC&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ASC&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LIMIT&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DISTINCT&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GROUP BY&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SUM()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;time_bucket(&amp;lt;time_interval&amp;gt;, &amp;lt;time_col&amp;gt;)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;CTE’s &lt;code&gt;WITH&lt;/code&gt; &lt;code&gt;AS&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;COUNT()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;approx_percentile()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;min_val()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;max_val()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;mean()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;num_vals()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;percentile_agg()&lt;/code&gt; [aggregate]&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tdigest()&lt;/code&gt; [aggregate]&lt;/li&gt;
&lt;li&gt;&lt;code&gt;histogram()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;unnest()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;generate_series()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;repeat()&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>postgres</category>
      <category>datascience</category>
      <category>timescaledb</category>
      <category>timeseries</category>
    </item>
    <item>
      <title>Speeding up data analysis with TimescaleDB and PostgreSQL</title>
      <dc:creator>mirandaauhl</dc:creator>
      <pubDate>Fri, 17 Sep 2021 14:15:54 +0000</pubDate>
      <link>https://dev.to/tigerdata/speeding-up-data-analysis-with-timescaledb-and-postgresql-3nj</link>
      <guid>https://dev.to/tigerdata/speeding-up-data-analysis-with-timescaledb-and-postgresql-3nj</guid>
      <description>&lt;h1&gt;
  
  
  Table of contents
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;Common data analysis tools and “the problem”&lt;/li&gt;
&lt;li&gt;Data analysis issue #1: storing and accessing data&lt;/li&gt;
&lt;li&gt;Data analysis issue #2: maximizing analysis speed and computation efficiency (the bigger the dataset, the bigger the problem)&lt;/li&gt;
&lt;li&gt;Data analysis issue #3: storing and maintaining scripts for data analysis&lt;/li&gt;
&lt;li&gt;Data analysis issue #4: easily utilizing new or additional technologies&lt;/li&gt;
&lt;li&gt;Wrapping up&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://blog.timescale.com/blog/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=time-series-blog-post" rel="noopener noreferrer"&gt;Time-series&lt;/a&gt; data is everywhere, and it drives decision-making in every industry. Time-series data collectively represents how a system, process, or behavior changes over time. Understanding these changes helps us to solve complex problems across numerous industries, including &lt;a href="https://blog.timescale.com/blog/simplified-prometheus-monitoring-for-your-entire-organization-with-promscale/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=promscale-blog-post" rel="noopener noreferrer"&gt;observability&lt;/a&gt;, &lt;a href="https://blog.timescale.com/blog/how-messari-uses-data-to-open-the-cryptoeconomy-to-everyone/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=messari-blog-post" rel="noopener noreferrer"&gt;financial services&lt;/a&gt;, &lt;a href="https://blog.timescale.com/blog/how-meter-group-brings-a-data-driven-approach-to-the-cannabis-production-industry/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=meter-groups-cannabis-blog-post" rel="noopener noreferrer"&gt;Internet of Things&lt;/a&gt;, and even &lt;a href="https://blog.timescale.com/blog/hacking-nfl-data-with-postgresql-timescaledb-and-sql/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=nfl-blog-post" rel="noopener noreferrer"&gt;professional football&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Depending on the type of application they’re building, developers end up collecting millions of rows of time-series data (and sometimes millions of rows of data every day or even every hour!). Making sense of this high-volume, high-fidelity data takes a particular set of data analysis skills that aren’t often exercised as part of the classic developer skillset. To perform time-series analysis that goes beyond basic questions, developers and data analysts need specialized tools, and as &lt;a href="https://db-engines.com/en/ranking_categories" rel="noopener noreferrer"&gt;time-series data grows in prominence&lt;/a&gt;, the &lt;strong&gt;efficiency&lt;/strong&gt; of these tools becomes even more important.&lt;/p&gt;

&lt;p&gt;Often, data analysts’ work can be boiled down to &lt;strong&gt;evaluating&lt;/strong&gt;, &lt;strong&gt;cleaning&lt;/strong&gt;, &lt;strong&gt;transforming&lt;/strong&gt;, and &lt;strong&gt;modeling&lt;/strong&gt; data. In my experience, I’ve found these actions are necessary for me to gain understanding from data, and I will refer to this as the “data analysis life cycle” throughout this post.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgjffib5zhl318233ti2y.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgjffib5zhl318233ti2y.jpeg" alt="Graphic showing the “data analysis lifecycle”, Evaluate -&amp;gt; Clean -&amp;gt; Transform -&amp;gt; Model" width="800" height="202"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Excel, R, and Python are arguably some of the most commonly used data analysis tools, and, while they are all fantastic tools, they may not be suited for every job. Speaking from experience, these tools can be especially inefficient for “data munging” at the early stages of the lifecycle; specifically, the &lt;strong&gt;evaluating data&lt;/strong&gt;, &lt;strong&gt;cleaning data&lt;/strong&gt;, and &lt;strong&gt;transforming data&lt;/strong&gt; steps involved in pre-modeling work.&lt;/p&gt;

&lt;p&gt;As I’ve worked with larger and more complex datasets, I’ve come to believe that databases built for specific types of data - such as time-series data - are more effective for data analysis.&lt;/p&gt;

&lt;p&gt;For background, &lt;a href="https://www.timescale.com/products/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-products-page" rel="noopener noreferrer"&gt;TimescaleDB&lt;/a&gt; is a &lt;em&gt;relational&lt;/em&gt; database for time-series data. If your analysis is based on time-series datasets, TimescaleDB can be a great choice not only for its scalability and dependability but also for its relational nature. Because TimescaleDB is packaged as an extension to PostgreSQL, you’ll be able to look at your time-series data alongside your relational data and get even more insight. (I recognize that as a Developer Advocate at Timescale, I might be a &lt;em&gt;little&lt;/em&gt; biased 😊…)&lt;/p&gt;

&lt;p&gt;In this four-part blog series, I will discuss each of the three data munging steps in the analysis lifecycle in-depth and demonstrate how to use TimescaleDB as a powerful tool for your data analysis.&lt;/p&gt;

&lt;p&gt;In this introductory post, I'll explore a few of the common frustrations that I experienced with popular data analysis tools, and from there, dive into how I’ve used TimescaleDB to help alleviate each of those pain points.&lt;/p&gt;

&lt;p&gt;In future posts we'll look at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How TimescaleDB data analysis functionality can replace work commonly performed in Python and pandas&lt;/li&gt;
&lt;li&gt;How TimescaleDB vs. Python and pandas compare (benchmarking a standard data analysis workflow)&lt;/li&gt;
&lt;li&gt;How to use TimescaleDB to conduct an end-to-end, deep-dive data analysis, using real yellow taxi cab data from the &lt;a href="https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page" rel="noopener noreferrer"&gt;New York City Taxi and Limousine Commission&lt;/a&gt; (NYC TLC).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are interested in trying out TimescaleDB and PostgreSQL functionality right away, &lt;a href="https://www.timescale.com/timescale-signup/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-signup" rel="noopener noreferrer"&gt;sign up for a free 30-day trial&lt;/a&gt; or &lt;a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-self-hosted-guide" rel="noopener noreferrer"&gt;install and manage it on your instances&lt;/a&gt;. (You can also learn more by &lt;a href="https://docs.timescale.com/timescaledb/latest/tutorials/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-tutorials" rel="noopener noreferrer"&gt;following one of our many tutorials&lt;/a&gt;.)&lt;/p&gt;

&lt;h1&gt;
  
  
  Common data analysis tools and “the problem” &lt;a&gt;&lt;/a&gt;
&lt;/h1&gt;

&lt;p&gt;As we’ve discussed, the three most popular tools used for data analysis are Excel, R, and Python. While they are great tools in their own right, they are not optimized to efficiently perform every step in the analysis process.&lt;/p&gt;

&lt;p&gt;In particular, most data scientists (including myself!) struggle with similar issues as the amount of data grows or the same analysis needs to be redone month after month.&lt;/p&gt;

&lt;p&gt;Some of these struggles include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data storage and access: Where is the best place to store and maintain my data for analysis?&lt;/li&gt;
&lt;li&gt;Data size and its influence on the analysis: How can I improve efficiency for data munging tasks, especially as data scales?&lt;/li&gt;
&lt;li&gt;Script storage and accessibility: What can I do to improve data munging script storage and maintenance?&lt;/li&gt;
&lt;li&gt;Easily utilizing new technologies: How could I set up my data analysis toolchain to allow for easy transitions to new technologies?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So buckle in, keep your arms and legs in the vehicle at all times, and let’s start looking at these problems!&lt;/p&gt;

&lt;h1&gt;
  
  
  Data analysis issue #1: storing and accessing data
&lt;/h1&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To do data analysis, you need access to… data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faugwnodx7y3h0ucy5rv2.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faugwnodx7y3h0ucy5rv2.gif" alt="Image of Data from Star Trek smiling" width="480" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;
        &lt;a href="https://giphy.com/gifs/rIq6ASPIqo2k0" rel="noopener noreferrer"&gt;via GIPHY&lt;/a&gt;
&lt;/center&gt;

&lt;p&gt;Managing where that data lives, and how easily you can access it is the preliminary (and often most important) step in the analysis journey. Every time I begin a new data analysis project, this is often where I run into my first dilemma. Regardless of the original data source, I always ask “where is the best place to store and maintain the data as I start working through the data munging process?”&lt;/p&gt;

&lt;p&gt;Although it's becoming more common for data analysts to use databases for storing and querying data, it's still not ubiquitous. Too often, raw data is provided in a stream of CSV files or APIs that produce JSON. While this may be manageable for smaller projects, it can quickly become overwhelming to maintain and difficult to manage from project to project.&lt;/p&gt;

&lt;p&gt;For example, let’s consider how we might use Python as our data analysis tool of choice.&lt;/p&gt;

&lt;p&gt;While using Python for data analysis, I have the option of ingesting data through files/APIs OR a database connection.&lt;/p&gt;

&lt;p&gt;If I used files or APIs for querying data during analysis, I often faced questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where are the files located? What happens if the URL or parameters change for an API?&lt;/li&gt;
&lt;li&gt;What happens if duplicate files are made? And what if updates are made to one file, and not the other?&lt;/li&gt;
&lt;li&gt;How do I best share these files with colleagues?&lt;/li&gt;
&lt;li&gt;What happens if multiple files depend on one another?&lt;/li&gt;
&lt;li&gt;How do I prevent incorrect data from being added to the wrong column of a CSV? (ie. a decimal where a string should be)&lt;/li&gt;
&lt;li&gt;What about very large files? What is the ingestion rate for a 10MB, 100MB, 1GB, 1TB sized file?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After running into these initial problems project after project, &lt;strong&gt;I knew there had to be a better solution. I knew that I needed a single source of truth for my data – and it started to become clear that a specialized SQL database might be my answer!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now, let’s consider if I were to connect to TimescaleDB.&lt;/p&gt;

&lt;p&gt;By importing my time-series data into TimescaleDB, I can create one source of truth for all of my data. As a result, collaborating with others becomes as simple as sharing access to the database. Any modifications to the data munging process within the database means that all users have access to the same changes at the same time, opposed to parsing through CSV files to verify I have the right version.&lt;/p&gt;

&lt;p&gt;Additionally, databases can typically handle much larger data loads than a script written in Python or R. TimescaleDB was built to house, maintain, and query terabytes of data efficiently and cost-effectively (both computationally speaking AND for your wallet). With features like &lt;a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=caggs-docs" rel="noopener noreferrer"&gt;continuous aggregates&lt;/a&gt; and native &lt;a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/compression/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=compression-docs" rel="noopener noreferrer"&gt;columnar compression&lt;/a&gt;, storing and analyzing years of time-series data became efficient while still being easily accessible.&lt;/p&gt;

&lt;p&gt;In short, managing data over time, especially when it comes from different sources, can be a nightmare to maintain and access efficiently. But, it doesn’t have to be.&lt;/p&gt;

&lt;h1&gt;
  
  
  Data analysis issue #2: maximizing analysis speed and computation efficiency (the bigger the dataset, the bigger the problem)
&lt;/h1&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Excel, R, and Python are all capable of performing the first three steps of the data analysis “lifecycle”: evaluating, cleaning, and transforming data. However, these technologies are not generally optimized for speed or computational efficiency during the process.&lt;/p&gt;

&lt;p&gt;In numerous projects over the years, I’ve found that as the size of my dataset increased, the process of importing, cleaning, and transforming it became more difficult, time-consuming, and, in some cases impossible. For Python and R, parsing through large amounts of data seemed to take forever, and Excel would simply crash once hitting millions of rows.  &lt;/p&gt;

&lt;p&gt;Things became &lt;em&gt;especially&lt;/em&gt; difficult when I needed to create additional tables for things like aggregates or data transformations: some lines of code could take seconds or, in extreme cases, minutes to run depending on the size of the data, the computer I was using, or the complexity of the analysis.&lt;/p&gt;

&lt;p&gt;While seconds or minutes may not &lt;em&gt;seem&lt;/em&gt; like a lot, it adds up and amounts to hours or days of lost productivity when you’re performing analysis that needs to be run hundreds or thousands of times a month!&lt;/p&gt;

&lt;p&gt;To illustrate, let’s look at a Python example once again.&lt;/p&gt;

&lt;p&gt;Say I was working with this &lt;a href="https://www.kaggle.com/srinuti/residential-power-usage-3years-data-timeseries" rel="noopener noreferrer"&gt;IoT data set taken from Kaggle&lt;/a&gt;. The set contains two tables, one specifying energy consumption for a single home in Houston Texas, and the other documenting weather conditions.&lt;/p&gt;

&lt;p&gt;To run through analysis with Python, the first steps in my analysis would be to pull in the data and observe it.&lt;/p&gt;

&lt;p&gt;When using Python to do this, I would run code like this 👇&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;configparser&lt;/span&gt;


&lt;span class="c1"&gt;## use config file for database connection information
&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;configparser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ConfigParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;env.ini&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;## establish conntection
&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USERINFO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DB_NAME&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
                        &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USERINFO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;HOST&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
                        &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USERINFO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USER&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
                        &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USERINFO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PASS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
                        &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USERINFO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PORT&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;## define the queries for selecting data out of our database                        
&lt;/span&gt;&lt;span class="n"&gt;query_weather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;select * from weather&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;query_power&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;select * from power_usage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="c1"&gt;## create cursor to extract data and place it into a DataFrame
&lt;/span&gt;&lt;span class="n"&gt;cursor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_weather&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;weather_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_power&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;power_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;## you will have to manually set the column names for the data frame
&lt;/span&gt;&lt;span class="n"&gt;weather_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weather_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;temp_max&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;temp_avg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;temp_min&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dew_max&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dew_avg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dew_min&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hum_max&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hum_avg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hum_min&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wind_max&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wind_avg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wind_min&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;press_max&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;press_avg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;press_min&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;precipit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day_of_week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;power_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;power_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;startdate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value_kwh&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day_of_week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weather_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Altogether, this code took 2.718 seconds to run using my &lt;a href="https://www.apple.com/shop/buy-mac/macbook-pro/16-inch-space-gray-2.3ghz-8-core-processor-1tb#" rel="noopener noreferrer"&gt;2019 MacBook Pro laptop with 32GB memory&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But, what about if I run this equivalent script with SQL in the database?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;weather&lt;/span&gt;
&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;power_usage&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;startdate&lt;/th&gt;
&lt;th&gt;value_kwh&lt;/th&gt;
&lt;th&gt;day_of_week&lt;/th&gt;
&lt;th&gt;notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 01:00:00&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 02:00:00&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 03:00:00&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 04:00:00&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 05:00:00&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 06:00:00&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 07:00:00&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 08:00:00&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 09:00:00&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 10:00:00&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 11:00:00&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 12:00:00&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 13:00:00&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 14:00:00&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 15:00:00&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 16:00:00&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 17:00:00&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This query only took 0.342 seconds to run, almost 8x faster when compared to the Python script.&lt;/p&gt;

&lt;p&gt;This time difference makes a lot of sense when we consider that Python must connect to a database, then run the SQL query, then parse the retrieved data, and then import it into a DataFrame. While almost three seconds is fast, this extra time for processing adds up as the script becomes more complicated and more data munging tasks are added.&lt;/p&gt;

&lt;p&gt;Pulling in the data and observing it is only the beginning of my analysis! What happens when I need to perform a transforming task, like aggregating the data?&lt;/p&gt;

&lt;p&gt;For this dataset, when we look at the &lt;code&gt;power_usage&lt;/code&gt; table - as seen above - kWh readings are recorded every hour. If I want to do daily analysis, I have to aggregate the hourly data into “day buckets”.  &lt;/p&gt;

&lt;p&gt;If I used Python for this aggregation, I could use something like 👇&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# sum power usage by day, bucket by day
## create column for the day 
&lt;/span&gt;&lt;span class="n"&gt;day_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;startdate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;day_col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;agg_power&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;power_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value_kwh&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sum&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day_of_week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;unique&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;unique&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agg_power&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;...which takes 0.49 seconds to run (this does not include the time for importing our data).&lt;/p&gt;

&lt;p&gt;Alternatively, with the TimescaleDB &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=time-bucket-docs" rel="noopener noreferrer"&gt;&lt;code&gt;time_bucket()&lt;/code&gt;&lt;/a&gt; function, I could do this aggregation directly in the database using the following query 👇&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt; 
    &lt;span class="n"&gt;time_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'1 day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;startdate&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value_kwh&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;day_of_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;notes&lt;/span&gt;
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;power_usage&lt;/span&gt; &lt;span class="n"&gt;pu&lt;/span&gt; 
&lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;day_of_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt;
&lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;day&lt;/th&gt;
&lt;th&gt;sum&lt;/th&gt;
&lt;th&gt;day_of_week&lt;/th&gt;
&lt;th&gt;notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-06 00:00:00&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-07 00:00:00&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-08 00:00:00&lt;/td&gt;
&lt;td&gt;51&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-09 00:00:00&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;weekend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-10 00:00:00&lt;/td&gt;
&lt;td&gt;45&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;weekend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-11 00:00:00&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-01-12 00:00:00&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-02-06 00:00:00&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;weekend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-02-07 00:00:00&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;weekend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-02-08 00:00:00&lt;/td&gt;
&lt;td&gt;48&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-02-09 00:00:00&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016-02-10 00:00:00&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;weekday&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;...which only takes 0.087 seconds and is over 5x faster than the Python script.&lt;/p&gt;

&lt;p&gt;You can start to see a pattern here.&lt;/p&gt;

&lt;p&gt;As mentioned above, &lt;a href="https://docs.timescale.com/timescaledb/latest/overview/core-concepts/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=why-use-timescale-docs#why-use-timescaledb/" rel="noopener noreferrer"&gt;TimescaleDB was created to efficiently query and store time-series data&lt;/a&gt;. But simply querying data only scratches the surface of the possibilities TimescaleDB and PostgreSQL functionality provides.&lt;/p&gt;

&lt;p&gt;TimescaleDB and PostgreSQL offer a wide range of tools and functionality that can replace the need for additional tools to evaluate, clean, and transform your data. Some of the TimescaleDB functionality includes continuous aggregates, compression, and &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=hyperfunctions-doc" rel="noopener noreferrer"&gt;hyperfunctions&lt;/a&gt;; all of which allow you to do nearly all data munging tasks directly within the database.&lt;/p&gt;

&lt;p&gt;When I performed the evaluating, cleaning, and transforming steps of my analysis directly within TimescaleDB, I cut out the need to use additional tools - like Excel, R, or Python - for data munging tasks. I could pull cleaned and transformed data, ready for modeling, directly into Excel, R, or Python.&lt;/p&gt;

&lt;h1&gt;
  
  
  Data analysis issue #3: storing and maintaining scripts for data analysis
&lt;/h1&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another potential downside of exclusively using Excel, R, or Python for the entire data analysis workflow, is that all of the logic for analyzing the data is contained within a script file. Similar to the issues of having many different data sources, maintaining script files can be inconvenient and messy.  &lt;/p&gt;

&lt;p&gt;Some common issues that I - and many data analysts - run into include:&lt;/p&gt;

&lt;p&gt;*Losing files&lt;br&gt;
*Unintentionally creating duplicate files&lt;br&gt;
*Changing or updating some files but not others&lt;br&gt;
*Needing to write and run scripts to access transformed data (see below example)&lt;br&gt;
*Spending time re-running scripts whenever new raw data is added (see below example)&lt;/p&gt;

&lt;p&gt;While you can use a code repository to overcome some of these issues, it will not fix the last two.  &lt;/p&gt;

&lt;p&gt;Let’s consider our Python scenario again.&lt;/p&gt;

&lt;p&gt;Say that I used a Python script exclusively for all my data analysis tasks. What happens if I need to export my transformed data to use in a report on energy consumption in Texas?&lt;/p&gt;

&lt;p&gt;Likely, I would have to add some code within the script to allow for exporting the data and then run the script again to actually export it. Depending on the content of the script and how long it takes to transform the data, this could be pretty inconvenient and inefficient.&lt;/p&gt;

&lt;p&gt;What if I also just got a bunch of new energy usage and weather data? For me to incorporate this new raw data into existing visualizations or reports, I would need to run the script again and make sure that all of my data munging tasks run as expected.&lt;/p&gt;

&lt;p&gt;Database functions, like continuous aggregates and materialized views, can create transformed data that can be stored and queried directly from your database without running a script. Additionally, I can create policies for continuous aggregates to regularly keep this transformed data up-to-date any time raw data is modified. Because of these policies, I wouldn't have to worry about running scripts to re-transform data for use, making access to updated data efficient. With TimescaleDB, many of the data munging tasks in the analysis lifecycle that you would normally do within your scripts can be accomplished using built-in TimescaleDB and PostgreSQL functionality.&lt;/p&gt;

&lt;h1&gt;
  
  
  Data analysis issue #4: easily utilizing new or additional technologies
&lt;/h1&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Finally, the last step in the data analysis lifecycle: modeling. If I wanted to use a new tool or technology to create a visualization, it was difficult to easily take my transformed data and use it for modeling or visualizations elsewhere.&lt;/p&gt;

&lt;p&gt;Python, R, and Excel are all pretty great for their visualization and modeling capabilities. However, what happens when your company or team wants to adopt a new tool?&lt;/p&gt;

&lt;p&gt;In my experience, this often means either adding on another step to the analysis process, or rediscovering how to perform the evaluating, cleaning, and transforming steps within the new technology.&lt;/p&gt;

&lt;p&gt;For example, in one of my previous jobs, I was asked to convert a portion of my analysis into Power BI for business analytics purposes. Some of the visualizations my stakeholders wanted required me to access transformed data from my Python script. At the time, I had the option to export the data from my Python script or figure out how to transform the data in Power BI directly. Both options were not ideal and were guaranteed to take extra time.&lt;/p&gt;

&lt;p&gt;When it comes to adopting new visualization or modeling tools, using a database for evaluating, cleaning, and transforming data can again work in your favor. Most visualization tools - such as &lt;a href="https://grafana.com/" rel="noopener noreferrer"&gt;Grafana&lt;/a&gt;, &lt;a href="https://www.metabase.com/" rel="noopener noreferrer"&gt;Metabase&lt;/a&gt;, or &lt;a href="https://powerbi.microsoft.com/" rel="noopener noreferrer"&gt;Power BI&lt;/a&gt; - allow users to import data from a database directly.&lt;/p&gt;

&lt;p&gt;Since I can do most of my data munging tasks within TimescaleDB, adding or switching tools - such as using Power BI for dashboard capabilities - becomes as simple as connecting to my database, pulling in the munged data,  and using the new tool for visualizations and modeling.&lt;/p&gt;

&lt;h1&gt;
  
  
  Wrapping up
&lt;/h1&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In summary, Excel, R, and Python are all great tools to use for analysis, but may not be the best tools for every job. Case in point: my struggles with time-series data analysis, especially on big datasets.&lt;/p&gt;

&lt;p&gt;With TimescaleDB functionality, you can house your data and perform the evaluating, cleaning, and transforming aspects of data analysis, all directly within your database – and solve a lot of common data analysis woes in the process (which I’ve - hopefully! - demonstrated in this post)&lt;/p&gt;

&lt;p&gt;In the blog posts to come, I’ll explore TimescaleDB and PostgreSQL functionality compared to Python, benchmark TimescaleDB performance vs. Python and pandas for data munging tasks, and conduct a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).&lt;/p&gt;

&lt;p&gt;If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, &lt;a href="https://slack.timescale.com/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=slack" rel="noopener noreferrer"&gt;join our &lt;strong&gt;community Slack&lt;/strong&gt;&lt;/a&gt;, where you'll find an active community of time-series enthusiasts and various Timescale team members (including me!).&lt;/p&gt;

&lt;p&gt;If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can sign up for &lt;a href="https://www.timescale.com/timescale-signup/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-signup" rel="noopener noreferrer"&gt;a free 30-day trial&lt;/a&gt; or &lt;a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-self-hosted-guide" rel="noopener noreferrer"&gt;install TimescaleDB&lt;/a&gt; and manage it on your current PostgreSQL instances. We also have a bunch of great &lt;a href="https://docs.timescale.com/timescaledb/latest/tutorials/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=timescale-tutorials" rel="noopener noreferrer"&gt;tutorials&lt;/a&gt; to help get you started.&lt;/p&gt;

&lt;p&gt;Until next time!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was originally written by Miranda Auhl and published on the &lt;a href="https://blog.timescale.com/blog/speeding-up-data-analysis/?utm_source=dev-to&amp;amp;utm_medium=blog&amp;amp;utm_id=tsdb-for-data-analysis&amp;amp;utm_content=original-blog-post" rel="noopener noreferrer"&gt;https://blog.timescale.com&lt;/a&gt; on September 9, 2021&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>datascience</category>
      <category>postgres</category>
      <category>timeseries</category>
    </item>
    <item>
      <title>Hacking NFL data with PostgreSQL, TimescaleDB, and SQL</title>
      <dc:creator>mirandaauhl</dc:creator>
      <pubDate>Fri, 30 Jul 2021 18:55:02 +0000</pubDate>
      <link>https://dev.to/tigerdata/hacking-nfl-data-with-postgresql-timescaledb-and-sql-5e24</link>
      <guid>https://dev.to/tigerdata/hacking-nfl-data-with-postgresql-timescaledb-and-sql-5e24</guid>
      <description>&lt;h2&gt;
  
  
  Table of contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;The NFL dataset&lt;/li&gt;
&lt;li&gt;Accessing the data&lt;/li&gt;
&lt;li&gt;Let's start exploring!&lt;/li&gt;
&lt;li&gt;The power of SQL&lt;/li&gt;
&lt;li&gt;Faster insights with PostgreSQL and TimescaleDB&lt;/li&gt;
&lt;li&gt;Faster queries with TimescaleDB continuous aggregates&lt;/li&gt;
&lt;li&gt;Advanced SQL data analysis with TimescaleDB hyperfunctions&lt;/li&gt;
&lt;li&gt;Where can the data take you?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;Learn how to use time-series data provided by the NFL to uncover valuable insights into many player performance metrics – and ways to apply the same methods to improve your fantasy league team, your knowledge of the game, or your viewing experience - all with PostgreSQL, standard SQL, and freely available extensions.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Time-series data is everywhere, including, much to our surprise, the world of professional sports. At &lt;a href="https://www.timescale.com/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=timescale-site" rel="noopener noreferrer"&gt;Timescale&lt;/a&gt;, we're always looking for fun ways to showcase the expanding reach of time-series data. &lt;a href="https://docs.timescale.com/timescaledb/latest/tutorials/analyze-intraday-stocks/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=stock-tutorial" rel="noopener noreferrer"&gt;Stock&lt;/a&gt;, &lt;a href="https://docs.timescale.com/timescaledb/latest/tutorials/analyze-cryptocurrency-data/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=crypto-tutorial" rel="noopener noreferrer"&gt;cryptocurrency&lt;/a&gt;, &lt;a href="https://docs.timescale.com/timescaledb/latest/tutorials/nyc-taxi-cab/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=iot-tutorial" rel="noopener noreferrer"&gt;IoT&lt;/a&gt;, and &lt;a href="https://docs.timescale.com/timescaledb/latest/tutorials/promscale/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=promscale-tutorial" rel="noopener noreferrer"&gt;infrastructure metrics&lt;/a&gt; data are relatively common and widely understood time-series data scenarios. Head to Twitter on any given day, search for &lt;a href="https://twitter.com/hashtag/TimeSeries" rel="noopener noreferrer"&gt;#timeseries&lt;/a&gt; or &lt;a href="https://twitter.com/hashtag/TimescaleDB" rel="noopener noreferrer"&gt;#TimescaleDB&lt;/a&gt;, and you're sure to find questions about high-frequency trading or massive scale observability data with tools like Prometheus.&lt;/p&gt;

&lt;p&gt;You can imagine our excitement, then, when we happened upon the &lt;a href="https://operations.nfl.com/gameday/analytics/big-data-bowl/" rel="noopener noreferrer"&gt;NFL Big Data Bowl&lt;/a&gt;, an annual competition that encourages the data science community to use historical player position and play data to create machine learning models.&lt;/p&gt;

&lt;p&gt;Did the NFL &lt;strong&gt;&lt;em&gt;really&lt;/em&gt;&lt;/strong&gt; give access to 18+ million rows of detailed play data from every regular season NFL game?&lt;/p&gt;

&lt;p&gt;For background, the National Football League (NFL) is the US professional sports league for American football, and the NFL season is followed by tens of millions of people, culminating in the annual Super Bowl (which attracts 100M+ global viewers, whether for the game or for the commercials).&lt;/p&gt;

&lt;p&gt;Each NFL game takes place as a series of “plays,” in which the two teams try to score and prevent the other team from scoring. There are approximately 200 plays per game, with up to 15 games a week during the regular season. A healthy amount of data, but nothing unmanageable.&lt;/p&gt;

&lt;p&gt;So, at first glance, football game metrics might not immediately jump out as anything special.&lt;/p&gt;

&lt;p&gt;But then the NFL did something pretty ambitious and amazing.&lt;/p&gt;

&lt;p&gt;All &lt;a href="https://operations.nfl.com/gameday/technology/nfl-next-gen-stats/" rel="noopener noreferrer"&gt;NFL players are equipped with RFID chips&lt;/a&gt; that track players’ position, speed, and various other metrics, which teams use to identify trends, mitigate risks, and continuously optimize. The NFL started tracking and storing data for every player on the field, for every play, for every game.&lt;/p&gt;

&lt;p&gt;As a result, we now have access to a very detailed analysis of exactly how a play unfolded, how quickly various players accelerated during each play, and the play’s outcome. A traditional view of play-by-play metrics is “down and distance” and the result of the play (yards gained, whether or not there was a score, and so on). With the NFL’s dataset, we're able to mine approximately 100 data points at 100-millisecond intervals throughout the play to see speed, distance, involved players, and much more.&lt;/p&gt;

&lt;p&gt;This isn’t ordinary data. &lt;a href="https://blog.timescale.com/blog/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=time-series-blog" rel="noopener noreferrer"&gt;This is time-series data&lt;/a&gt;. Time-series data is a sequence of data points collected over time intervals, giving us the ability to track changes over time. In the case of the NFL’s dataset, we have time-series data that represents how a play changes, including the locations of the players on the field, the location of the ball, the relative acceleration of players in the field of play, and so much more.&lt;/p&gt;

&lt;p&gt;Time-series data comes at you fast, sometimes generating millions of data points per second (&lt;a href="https://blog.timescale.com/blog/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=time-series-blog" rel="noopener noreferrer"&gt;read more about time-series data&lt;/a&gt;). Because of the sheer volume and rate of information, time-series data can already be complex to query and analyze, which is why we built &lt;a href="https://blog.timescale.com/blog/timescaledb-2-0-a-multi-node-petabyte-scale-completely-free-relational-database-for-time-series/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=multi-node-blog" rel="noopener noreferrer"&gt;TimescaleDB, a multi-node, petabyte-scale, completely free relational database for time-series&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We couldn't pass up the opportunity to look at the NFL dataset with TimescaleDB, exploring ways we could peer deeper into player performance in hopes of providing insights about overall player performance in the coming season.&lt;/p&gt;

&lt;p&gt;Read on for more information about the &lt;a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview" rel="noopener noreferrer"&gt;NFL’s dataset&lt;/a&gt; and how you can start using it, plus some sample queries to jumpstart your analysis. They may help you get more enjoyment out of the game.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you’d like to get started with NFL data, you can spin up a fully managed TimescaleDB service:&lt;/strong&gt; create an account to try it for free for 30 days. The instructions later in this post will take you through how to ingest the data and start using it for analysis.&lt;/p&gt;

&lt;p&gt;If you’re new to time-series data or just have some questions you’d like to ask about the dataset, &lt;a href="https://slack.timescale.com/" rel="noopener noreferrer"&gt;join our public Slack community&lt;/a&gt;, where you’ll find Timescale team members and thousands of time-series enthusiasts, and we’ll be happy to help you.&lt;/p&gt;




&lt;h2&gt;
  
  
  The NFL dataset
&lt;/h2&gt;

&lt;p&gt;Over the last few years, the NFL and Kaggle have collaborated on the &lt;a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview" rel="noopener noreferrer"&gt;NFL Big Data Bowl&lt;/a&gt;. The goal is to use historical data to answer a predetermined genre of questions, typically producing a machine learning model that can help predict the outcome of certain plays during regular season games.&lt;/p&gt;

&lt;p&gt;Although the 2020/2021 contest is over, the sample dataset they provided from a prior season is still available for download and analysis. The 2020/2021 competition focused on pass play defense efficiency; therefore, only the tracking data for offensive and defensive "playmakers" is available in the dataset. No offensive or defensive linemen data is included. (You can read more about &lt;a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/discussion/217170" rel="noopener noreferrer"&gt;last year’s winners&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;(Keep watching the &lt;a href="https://operations.nfl.com/gameday/analytics/big-data-bowl/" rel="noopener noreferrer"&gt;NFL website&lt;/a&gt; for more information on the next Big Data Bowl.)&lt;/p&gt;




&lt;h2&gt;
  
  
  Accessing the data
&lt;/h2&gt;

&lt;p&gt;For the purposes of this blog post and accompanying tutorial, we will use the &lt;a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview" rel="noopener noreferrer"&gt;sample data provided by the NFL&lt;/a&gt;. This data is from the 2018 NFL season and is available as CSV files, including game-specific data and week-by-week tracking data for each player involved in the "offensive" part of the pass play. Contest participants in the next season of the contest will have access to new weekly game data.&lt;/p&gt;

&lt;p&gt;This data is also very relational in nature, which means that SQL is a great medium to start gleaning value – without the need for Jupyter notebooks, other data science specific languages (like Python or R), or additional toolsets.&lt;/p&gt;

&lt;p&gt;If you want to follow along - or recreate! - the queries we go through below, &lt;a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=nfl-tutorial" rel="noopener noreferrer"&gt;follow our tutorial&lt;/a&gt; to set up the tables, ingest data, and start analyzing data in TimescaleDB. For those unfamiliar with TimescaleDB, it’s built on PostgreSQL, so you’ll find that all of our queries are standard SQL. If you know SQL, you’ll know how to do everything here. (Some of the more advanced query examples we provide require our new, advanced hyperfunctions, which come pre-installed with any &lt;a href="https://console.forge.timescale.com/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=forge" rel="noopener noreferrer"&gt;Timescale Forge instance&lt;/a&gt;.)&lt;/p&gt;




&lt;h2&gt;
  
  
  Let's start exploring!
&lt;/h2&gt;

&lt;p&gt;We've provided the steps needed to ingest the dataset into TimescaleDB in the &lt;a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=nfl-tutorial" rel="noopener noreferrer"&gt;accompanying tutorial&lt;/a&gt;, so we won’t go into that here.&lt;/p&gt;

&lt;p&gt;The NFL dataset includes the following data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Games:&lt;/strong&gt; all relevant data about each game of the regular season, including date, teams, time, and location&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Players:&lt;/strong&gt; information on each player, including what team they play for and their originating college&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plays:&lt;/strong&gt; a wealth of data about each pass play in the game. Helpful fields include the down, description of the play that happened, line of scrimmage, and total offensive yardage, among other details.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Week [1-17]:&lt;/strong&gt; for each week of the season, the NFL provides a new CSV file with the tracking data of every player, for every play (pass plays for this data). Interesting fields include X/Y position data (relative to the football field) every few hundred milliseconds throughout each play, player acceleration, and the "type" of a route that was taken. (In our tutorial, this data is imported into the &lt;code&gt;tracking&lt;/code&gt; table and totals almost 20 million rows of time-series data.)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In addition to the NFL dataset, we also provide some extra data from Wikipedia that includes game scores and stadium conditions for each game, which you can load as part of the tutorial. With other time-series databases, it can be difficult to combine your time-series data with any other data you may have on hand (see our &lt;a href="https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=influx-compare-blog" rel="noopener noreferrer"&gt;TimescaleDB vs. InfluxDB comparison&lt;/a&gt; for reference).&lt;/p&gt;

&lt;p&gt;Because TimescaleDB is PostgreSQL with time-series super powers, it supports JOINS, so any extra relational data you want to add for deeper analysis is just a SQL query away. In our case, we’re able to combine the NFL’s play-by-play data along with weather data for each stadium.&lt;/p&gt;

&lt;p&gt;Once you have the data ready, the world of NFL playmakers is at your fingertips, so let’s get started!&lt;/p&gt;




&lt;h2&gt;
  
  
  The power of SQL
&lt;/h2&gt;

&lt;p&gt;Year after year, we see SQL listed as one of the most popular languages among developers on the &lt;a href="https://insights.stackoverflow.com/survey/2020#technology-programming-scripting-and-markup-languages-all-respondents" rel="noopener noreferrer"&gt;StackOverflow survey&lt;/a&gt;. Sometimes, however, we can be lured into thinking that the only way to gain insights from relational data is to query it with powerful data analytics tools and languages, create data frames, and use specialized regression algorithms before we can do anything productive.&lt;/p&gt;

&lt;p&gt;SQL, it often feels, is only useful for getting and storing data in applications and that we need to leave the "heavy lifting" of analysis to more mature tools.&lt;/p&gt;

&lt;p&gt;Not so! SQL can data munge with the best of them! Let's look at a first, quick example.&lt;/p&gt;

&lt;h3&gt;
  
  
  Average yards per position, per game
&lt;/h3&gt;

&lt;p&gt;For this first example, we'll query the &lt;code&gt;tracking&lt;/code&gt; table (the player movement data from all 17 weeks of games) and join to the &lt;code&gt;game&lt;/code&gt; table to determine the number of yards per player position, per game.&lt;/p&gt;

&lt;p&gt;The results give you a quick overview of how many yards different positions ran throughout each game. You could use this later to compare specific players to see how they compared, more or less yards, to that total.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;total_position_yards&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dis&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;position_yards&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;POSITION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gameid&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;tracking&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; 
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;POSITION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gameid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;position_yards&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;game_date&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;game&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;total_position_yards&lt;/span&gt; &lt;span class="n"&gt;tpy&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;game_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tpy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gameid&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;POSITION&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'QB'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'RB'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'WR'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'TE'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;game_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;POSITION&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Number of plays by offensive player
&lt;/h3&gt;

&lt;p&gt;As a season progresses and players get injured (or traded), it's helpful to know which of the available players have more playing experience, rather than those that have been sitting on the sideline for most of the season. Players with more playing time are often able to contribute to the outcome of the game.&lt;/p&gt;

&lt;p&gt;This query finds all players that were on the offense for any play and counts how many total passing plays they have been a part of, ordered by total passing plays descending.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;snap_events&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="c1"&gt;-- Create a table that filters the play events to show only snap plays&lt;/span&gt;
&lt;span class="c1"&gt;-- and display the players team information&lt;/span&gt;
 &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;player_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gameid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;playid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="k"&gt;CASE&lt;/span&gt;
     &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;team&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'away'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;visitor_team&lt;/span&gt;
     &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;team&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'home'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;home_team&lt;/span&gt;
     &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
     &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;team_name&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;tracking&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
 &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;game&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gameid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;game_id&lt;/span&gt;
 &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'snap_direct'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'ball_snap'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;-- Count these events &amp;amp; filter results to only display data when the player was&lt;/span&gt;
&lt;span class="c1"&gt;-- on the offensive&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;display_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;play_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;team_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;snap_events&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;play&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gameid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gameid&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;playid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;playid&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;player&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;team_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;possessionteam&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;display_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;team_name&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;play_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;player_id&lt;/th&gt;
&lt;th&gt;display_name&lt;/th&gt;
&lt;th&gt;play_count&lt;/th&gt;
&lt;th&gt;team_name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2506109&lt;/td&gt;
&lt;td&gt;Ben Roethlisberger&lt;/td&gt;
&lt;td&gt;725&lt;/td&gt;
&lt;td&gt;PIT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2558149&lt;/td&gt;
&lt;td&gt;JuJu Smith-Schuster&lt;/td&gt;
&lt;td&gt;691&lt;/td&gt;
&lt;td&gt;PIT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2533031&lt;/td&gt;
&lt;td&gt;Andrew Luck&lt;/td&gt;
&lt;td&gt;683&lt;/td&gt;
&lt;td&gt;IND&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2508061&lt;/td&gt;
&lt;td&gt;Antonio Brown&lt;/td&gt;
&lt;td&gt;679&lt;/td&gt;
&lt;td&gt;PIT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;310&lt;/td&gt;
&lt;td&gt;Matt Ryan&lt;/td&gt;
&lt;td&gt;659&lt;/td&gt;
&lt;td&gt;ATL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2506363&lt;/td&gt;
&lt;td&gt;Aaron Rodgers&lt;/td&gt;
&lt;td&gt;656&lt;/td&gt;
&lt;td&gt;GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2505996&lt;/td&gt;
&lt;td&gt;Eli Manning&lt;/td&gt;
&lt;td&gt;639&lt;/td&gt;
&lt;td&gt;NYG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2543495&lt;/td&gt;
&lt;td&gt;Davante Adams&lt;/td&gt;
&lt;td&gt;630&lt;/td&gt;
&lt;td&gt;GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2540158&lt;/td&gt;
&lt;td&gt;Zach Ertz&lt;/td&gt;
&lt;td&gt;629&lt;/td&gt;
&lt;td&gt;PHI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2532820&lt;/td&gt;
&lt;td&gt;Kirk Cousins&lt;/td&gt;
&lt;td&gt;621&lt;/td&gt;
&lt;td&gt;MIN&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;79860&lt;/td&gt;
&lt;td&gt;Matthew Stafford&lt;/td&gt;
&lt;td&gt;619&lt;/td&gt;
&lt;td&gt;DET&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2504211&lt;/td&gt;
&lt;td&gt;Tom Brady&lt;/td&gt;
&lt;td&gt;613&lt;/td&gt;
&lt;td&gt;NE&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you’re familiar with American football, you might know that players are substituted in and out of the game based on game conditions. Stronger, larger players may play in some situations, while faster, more agile players may play in others.&lt;/p&gt;

&lt;p&gt;Quarterbacks, however, are the most “important” players on the field, and tend to play more than others. However, by omitting quarterbacks, we can get a deeper insight into players across all other positions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;snap_events&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="c1"&gt;-- Create a table that filters the play events to show only snap plays&lt;/span&gt;
&lt;span class="c1"&gt;-- and display the players team information&lt;/span&gt;
 &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;player_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gameid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;playid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="k"&gt;CASE&lt;/span&gt;
     &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;team&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'away'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;visitor_team&lt;/span&gt;
     &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;team&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'home'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;home_team&lt;/span&gt;
     &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
     &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;team_name&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;tracking&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
 &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;game&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gameid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;game_id&lt;/span&gt;
 &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'snap_direct'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'ball_snap'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;-- Count these events &amp;amp; filter results to only display data when the player was&lt;/span&gt;
&lt;span class="c1"&gt;-- on the offensive&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;display_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;play_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;team_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"position"&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;snap_events&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;play&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gameid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gameid&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;playid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;playid&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;player&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;team_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;possessionteam&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"position"&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;'QB'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;display_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;team_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"position"&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;play_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So, now we can see the non-quarterbacks who are on offense the most in a season:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;player_id&lt;/th&gt;
&lt;th&gt;display_name&lt;/th&gt;
&lt;th&gt;play_count&lt;/th&gt;
&lt;th&gt;team_name&lt;/th&gt;
&lt;th&gt;position&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2558149&lt;/td&gt;
&lt;td&gt;JuJu Smith-Schuster&lt;/td&gt;
&lt;td&gt;691&lt;/td&gt;
&lt;td&gt;PIT&lt;/td&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2508061&lt;/td&gt;
&lt;td&gt;Antonio Brown&lt;/td&gt;
&lt;td&gt;679&lt;/td&gt;
&lt;td&gt;PIT&lt;/td&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2543495&lt;/td&gt;
&lt;td&gt;Davante Adams&lt;/td&gt;
&lt;td&gt;630&lt;/td&gt;
&lt;td&gt;GB&lt;/td&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2540158&lt;/td&gt;
&lt;td&gt;Zach Ertz&lt;/td&gt;
&lt;td&gt;629&lt;/td&gt;
&lt;td&gt;PHI&lt;/td&gt;
&lt;td&gt;TE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2541785&lt;/td&gt;
&lt;td&gt;Adam Thielen&lt;/td&gt;
&lt;td&gt;612&lt;/td&gt;
&lt;td&gt;MIN&lt;/td&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2543468&lt;/td&gt;
&lt;td&gt;Mike Evans&lt;/td&gt;
&lt;td&gt;610&lt;/td&gt;
&lt;td&gt;TB&lt;/td&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2555295&lt;/td&gt;
&lt;td&gt;Sterling Shepard&lt;/td&gt;
&lt;td&gt;610&lt;/td&gt;
&lt;td&gt;NYG&lt;/td&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2540169&lt;/td&gt;
&lt;td&gt;Robert Woods&lt;/td&gt;
&lt;td&gt;604&lt;/td&gt;
&lt;td&gt;LA&lt;/td&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2552600&lt;/td&gt;
&lt;td&gt;Nelson Agholor&lt;/td&gt;
&lt;td&gt;604&lt;/td&gt;
&lt;td&gt;PHI&lt;/td&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2543488&lt;/td&gt;
&lt;td&gt;Jarvis Landry&lt;/td&gt;
&lt;td&gt;592&lt;/td&gt;
&lt;td&gt;CLE&lt;/td&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2540165&lt;/td&gt;
&lt;td&gt;DeAndre Hopkins&lt;/td&gt;
&lt;td&gt;587&lt;/td&gt;
&lt;td&gt;HOU&lt;/td&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2543498&lt;/td&gt;
&lt;td&gt;Brandin Cooks&lt;/td&gt;
&lt;td&gt;581&lt;/td&gt;
&lt;td&gt;LA&lt;/td&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Sack percentage by quarterback on passing plays
&lt;/h3&gt;

&lt;p&gt;We can start to go a little deeper by extracting specific data from the &lt;code&gt;tracking&lt;/code&gt; table and layering queries on top of it to make correlations. One piece of information that might be helpful in your analysis is knowing which quarterbacks are sacked most often during passing plays. In football, a “sack” is a negative play for the offense, and quarterbacks who get sacked more often tend to be lower performers overall.&lt;/p&gt;

&lt;p&gt;Once you know those players, you could expand your analysis to see if they are sacked more on specific types of plays (shotgun formation) or maybe if sacks occur more often in a specific quarter of the game (maybe the fourth quarter because the offensive line is more tired, or the team tends to be behind late in games and must pass more often).&lt;/p&gt;

&lt;p&gt;Queries like this can quickly show you quarterbacks that are more likely to get sacked, particularly when they play a strong defensive team. To get started, we wanted to find the sack percentage of each quarterback based on the total number of pass plays they were involved in during the regular season. To do that we approached the tracking data by layering on Common Table Expressions so that each query could build upon previous results.&lt;/p&gt;

&lt;p&gt;First, we select the distinct list of all plays, for each quarterback (&lt;code&gt;qb_plays&lt;/code&gt;). The reason we do a &lt;code&gt;SELECT DISTINCT…&lt;/code&gt; is because the tracking table holds multiple entries for each player, for each play. We just need one row for each play, for each quarterback.&lt;/p&gt;

&lt;p&gt;With this result, we can then count the number of total plays per quarterback (&lt;code&gt;total_qb_plays&lt;/code&gt;), the total number of games each quarterback played (&lt;code&gt;qb_games&lt;/code&gt;) and then finally the number of pass plays the quarterback was a part of that resulted in a sack (&lt;code&gt;sacks&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;With that data in hand, we can finally query all of the values, do a percentage calculation, and order it by the total sack count.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;qb_plays&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;POSITION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;playid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gameid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;POSITION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;playid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;player_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gameid&lt;/span&gt; 
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;tracking&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; 
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;POSITION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'QB'&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;total_qb_plays&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;play_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;player_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;qb_plays&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;player_id&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;qb_games&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;gameid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;game_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;player_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;qb_plays&lt;/span&gt; 
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;player_id&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;sacks&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;sack_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;player_id&lt;/span&gt; 
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;play&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;
    &lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;qb_plays&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gameid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qb_plays&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gameid&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;playid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qb_plays&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;playid&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passresult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'S'&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;player_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;play_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;game_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sack_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sack_count&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;play_count&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="n"&gt;sack_percentage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;display_name&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;total_qb_plays&lt;/span&gt; &lt;span class="n"&gt;tqp&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;qb_games&lt;/span&gt; &lt;span class="n"&gt;qg&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;tqp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sacks&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;player&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;tqp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;player&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sack_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="n"&gt;NULLS&lt;/span&gt; &lt;span class="k"&gt;last&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're an ardent football fan, the results from 2018 probably don't surprise you.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;play_count&lt;/th&gt;
&lt;th&gt;game_count&lt;/th&gt;
&lt;th&gt;sack_count&lt;/th&gt;
&lt;th&gt;sack_percentage&lt;/th&gt;
&lt;th&gt;display_name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;579&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;65&lt;/td&gt;
&lt;td&gt;11.23&lt;/td&gt;
&lt;td&gt;Deshaun Watson&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;602&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;55&lt;/td&gt;
&lt;td&gt;9.14&lt;/td&gt;
&lt;td&gt;Dak Prescott&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;611&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;td&gt;8.67&lt;/td&gt;
&lt;td&gt;Derek Carr&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;656&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;49&lt;/td&gt;
&lt;td&gt;7.47&lt;/td&gt;
&lt;td&gt;Aaron Rodgers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;462&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;48&lt;/td&gt;
&lt;td&gt;10.39&lt;/td&gt;
&lt;td&gt;Russell Wilson&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;639&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;47&lt;/td&gt;
&lt;td&gt;7.36&lt;/td&gt;
&lt;td&gt;Eli Manning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;448&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;45&lt;/td&gt;
&lt;td&gt;10.04&lt;/td&gt;
&lt;td&gt;Josh Rosen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;659&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;43&lt;/td&gt;
&lt;td&gt;6.53&lt;/td&gt;
&lt;td&gt;Matt Ryan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;386&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;43&lt;/td&gt;
&lt;td&gt;11.14&lt;/td&gt;
&lt;td&gt;Marcus Mariota&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;619&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;41&lt;/td&gt;
&lt;td&gt;6.62&lt;/td&gt;
&lt;td&gt;Matthew Stafford&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;621&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;6.12&lt;/td&gt;
&lt;td&gt;Kirk Cousins&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;324&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;37&lt;/td&gt;
&lt;td&gt;11.42&lt;/td&gt;
&lt;td&gt;Ryan Tannehill&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;447&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;8.05&lt;/td&gt;
&lt;td&gt;Carson Wentz&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Of course, there are a few quarterbacks that always seem to have a way of avoiding a sack.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;play_count&lt;/th&gt;
&lt;th&gt;game_count&lt;/th&gt;
&lt;th&gt;sack_count&lt;/th&gt;
&lt;th&gt;sack_percentage&lt;/th&gt;
&lt;th&gt;display_name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;725&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;3.45&lt;/td&gt;
&lt;td&gt;Ben Roethlisberger&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;682&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;3.23&lt;/td&gt;
&lt;td&gt;Andrew Luck&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;613&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;3.43&lt;/td&gt;
&lt;td&gt;Tom Brady&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now, let’s try some more “advanced” queries and analyses.&lt;/p&gt;




&lt;h2&gt;
  
  
  Faster insights with PostgreSQL and TimescaleDB
&lt;/h2&gt;

&lt;p&gt;So far, the queries we've shown are interesting and help provide insights to various players throughout the season – but if you were looking closely, they're all regular SQL statements.&lt;/p&gt;

&lt;p&gt;Examining a season of NFL tracking data isn't like typical time-series data, however. Most of the queries we want to perform need to examine all 20 million rows in some way.&lt;/p&gt;

&lt;p&gt;This is where a tool that's been built for time-series analysis, even when the data isn't typical time-series data, can significantly improve your ability to examine the data and save money at the same time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Faster queries with TimescaleDB continuous aggregates
&lt;/h2&gt;

&lt;p&gt;We noticed that we often needed to build queries that started with the &lt;code&gt;tracking&lt;/code&gt; table, filtering data by specific players, positions, and games. Part of the reason is that the &lt;code&gt;play&lt;/code&gt; table doesn't list all of the players who were involved in a particular play. As a result, we need to cross-reference the &lt;code&gt;tracking&lt;/code&gt; table to identify the players who were involved in any given play.&lt;/p&gt;

&lt;p&gt;The first example query we demonstrated - “average yards per position, per game” - is a good example of this. The query begins by summing all yards, by position, for each game.&lt;/p&gt;

&lt;p&gt;This means that every row in &lt;code&gt;tracking&lt;/code&gt; has to be read and aggregated before we can do any other analysis. Scanning those 20 million rows is pretty boring, repetitive, and slow work – especially compared to the analysis we want to do!&lt;/p&gt;

&lt;p&gt;On our small test instance, the "average yards" query takes about 8 seconds to run. We could increase the size of the instance (which will cost us more money), or we could be smarter about how we query the data (which will cost us more time).&lt;/p&gt;

&lt;p&gt;Instead, we can use continuous aggregates to pre-aggregate the data we're querying over and over again, which reduces the amount of work TimescaleDB needs to do every time we run the query. (Continuous aggregates are like PostgreSQL materialized views. For more info, check out our &lt;a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=caggs" rel="noopener noreferrer"&gt;continuous aggregates docs&lt;/a&gt;.)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;player_yards_by_game_&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timescaledb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;continuous&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;player_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gameid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;time_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"time"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dis&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;yards&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;tracking&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;player_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gameid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running this query and creating a continuous aggregate, we can modify that first query just slightly, using this as our basis table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;total_position_yards&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yards&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;position_yards&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;POSITION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gameid&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;player_yards_by_game&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; 
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;POSITION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gameid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;position_yards&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;game_date&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;game&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;total_position_yards&lt;/span&gt; &lt;span class="n"&gt;tpy&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;game_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tpy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gameid&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;POSITION&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'QB'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'RB'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'WR'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'TE'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;game_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;POSITION&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;game_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We get the same result, but now the query runs in 100ms - &lt;strong&gt;800x faster!&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Advanced SQL data analysis with TimescaleDB hyperfunctions
&lt;/h2&gt;

&lt;p&gt;Finally, the more we dug into the data, the more and more we found we needed (or wanted) functions specifically tuned for time-series data analysis to answer the types of questions we wanted to ask.&lt;/p&gt;

&lt;p&gt;It is for this kind of analysis that we built &lt;a href="https://blog.timescale.com/blog/introducing-hyperfunctions-new-sql-functions-to-simplify-working-with-time-series-data-in-postgresql/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=hyperfunctions-blog" rel="noopener noreferrer"&gt;TimescaleDB hyperfunctions&lt;/a&gt;, a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Grouping data into percentiles
&lt;/h3&gt;

&lt;p&gt;The NFL dataset is a great use case for &lt;a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=hyperfunctions-percentile" rel="noopener noreferrer"&gt;percentiles&lt;/a&gt;. Being able to quickly find players that perform better or worse than some cohort is really powerful.&lt;/p&gt;

&lt;p&gt;As an example, we'll use the same continuous aggregate we created earlier (total yards, per game, per player) to find the median total yards traveled by position for each game.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;sum_yards&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="c1"&gt;--Add position to the table to allow for grouping by it later&lt;/span&gt;
 &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;display_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yards&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;yards&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gameid&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;player_yards_by_game&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
 &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;player&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;display_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gameid&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;--Find the mean and median for each position type&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;percentile_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yards&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mean_yards&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;approx_percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;percentile_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yards&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;median_yards&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sum_yards&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;POSITION&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;position&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;mean_yards&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;position&lt;/th&gt;
&lt;th&gt;mean_yards&lt;/th&gt;
&lt;th&gt;median_yards&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FS&lt;/td&gt;
&lt;td&gt;595.583433048431&lt;/td&gt;
&lt;td&gt;626.388099960848&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CB&lt;/td&gt;
&lt;td&gt;572.3336749867212&lt;/td&gt;
&lt;td&gt;592.2175990890378&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;td&gt;552.6508570179277&lt;/td&gt;
&lt;td&gt;555.5030569048633&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S&lt;/td&gt;
&lt;td&gt;530.6436781609186&lt;/td&gt;
&lt;td&gt;550.5961518474892&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SS&lt;/td&gt;
&lt;td&gt;522.5604103343453&lt;/td&gt;
&lt;td&gt;551.1296628916651&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MLB&lt;/td&gt;
&lt;td&gt;462.70229007633407&lt;/td&gt;
&lt;td&gt;490.77906906009343&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ILB&lt;/td&gt;
&lt;td&gt;402.7882871125599&lt;/td&gt;
&lt;td&gt;403.3779668359464&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OLB&lt;/td&gt;
&lt;td&gt;393.40014271151847&lt;/td&gt;
&lt;td&gt;390.6742117791442&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QB&lt;/td&gt;
&lt;td&gt;334.7025466893028&lt;/td&gt;
&lt;td&gt;352.1192705472368&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LB&lt;/td&gt;
&lt;td&gt;328.9812527472519&lt;/td&gt;
&lt;td&gt;257.72003396053884&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TE&lt;/td&gt;
&lt;td&gt;327.9515596330271&lt;/td&gt;
&lt;td&gt;257.72003396053884&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Finding extreme outliers
&lt;/h3&gt;

&lt;p&gt;Finally, we can build upon this percentile query to find players at each position that run more than 95% of all other players at that position. For some positions, like wide receiver or free safety, this could help us find the “outlier” players that are able to travel the field consistently throughout a game – and make plays!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;sum_yards&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="c1"&gt;--Add position to the table to allow for grouping by it later&lt;/span&gt;
 &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;display_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yards&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;yards&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;position&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;player_yards_by_game&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
 &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;player&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;display_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;position&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;position_percentile&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;POSITION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;approx_percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;percentile_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yards&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;p95&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sum_yards&lt;/span&gt; 
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;position&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;POSITION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;display_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;yards&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p95&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sum_yards&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
    &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;position_percentile&lt;/span&gt; &lt;span class="n"&gt;pp&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;POSITION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;position&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;yards&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;p95&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;POSITION&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'WR'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'FS'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'QB'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'TE'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;position&lt;/th&gt;
&lt;th&gt;display_name&lt;/th&gt;
&lt;th&gt;yards&lt;/th&gt;
&lt;th&gt;p95&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FS&lt;/td&gt;
&lt;td&gt;Eric Weddle&lt;/td&gt;
&lt;td&gt;13869.759999999997&lt;/td&gt;
&lt;td&gt;12320.288323166456&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FS&lt;/td&gt;
&lt;td&gt;Adrian Amos&lt;/td&gt;
&lt;td&gt;12989.439999999966&lt;/td&gt;
&lt;td&gt;12320.288323166456&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FS&lt;/td&gt;
&lt;td&gt;Tyrann Mathieu&lt;/td&gt;
&lt;td&gt;12565.219999999956&lt;/td&gt;
&lt;td&gt;12320.288323166456&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QB&lt;/td&gt;
&lt;td&gt;Aaron Rodgers&lt;/td&gt;
&lt;td&gt;7422.35999999995&lt;/td&gt;
&lt;td&gt;6667.51452813257&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QB&lt;/td&gt;
&lt;td&gt;Patrick Mahomes&lt;/td&gt;
&lt;td&gt;6985.989999999952&lt;/td&gt;
&lt;td&gt;6667.51452813257&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QB&lt;/td&gt;
&lt;td&gt;Matt Ryan&lt;/td&gt;
&lt;td&gt;6759.959999999969&lt;/td&gt;
&lt;td&gt;6667.51452813257&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TE&lt;/td&gt;
&lt;td&gt;Zach Ertz&lt;/td&gt;
&lt;td&gt;13124.58999999995&lt;/td&gt;
&lt;td&gt;10667.986199523099&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TE&lt;/td&gt;
&lt;td&gt;Jimmy Graham&lt;/td&gt;
&lt;td&gt;12693.679999999982&lt;/td&gt;
&lt;td&gt;10667.986199523099&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TE&lt;/td&gt;
&lt;td&gt;Travis Kelce&lt;/td&gt;
&lt;td&gt;12218.129999999957&lt;/td&gt;
&lt;td&gt;10667.986199523099&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TE&lt;/td&gt;
&lt;td&gt;David Njoku&lt;/td&gt;
&lt;td&gt;11502.159999999965&lt;/td&gt;
&lt;td&gt;10667.986199523099&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TE&lt;/td&gt;
&lt;td&gt;George Kittle&lt;/td&gt;
&lt;td&gt;11058.099999999975&lt;/td&gt;
&lt;td&gt;10667.986199523099&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TE&lt;/td&gt;
&lt;td&gt;Kyle Rudolph&lt;/td&gt;
&lt;td&gt;10761.949999999968&lt;/td&gt;
&lt;td&gt;10667.986199523099&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TE&lt;/td&gt;
&lt;td&gt;Jared Cook&lt;/td&gt;
&lt;td&gt;10678.22999999998&lt;/td&gt;
&lt;td&gt;10667.986199523099&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;td&gt;Antonio Brown&lt;/td&gt;
&lt;td&gt;16877.559999999965&lt;/td&gt;
&lt;td&gt;14271.23409723974&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;td&gt;Brandin Cooks&lt;/td&gt;
&lt;td&gt;15510.01999999995&lt;/td&gt;
&lt;td&gt;14271.23409723974&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;td&gt;JuJu Smith-Schuster&lt;/td&gt;
&lt;td&gt;15492.76999999996&lt;/td&gt;
&lt;td&gt;14271.23409723974&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;td&gt;Robert Woods&lt;/td&gt;
&lt;td&gt;15253.179999999958&lt;/td&gt;
&lt;td&gt;14271.23409723974&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;td&gt;Nelson Agholor&lt;/td&gt;
&lt;td&gt;15180.32999999997&lt;/td&gt;
&lt;td&gt;14271.23409723974&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;td&gt;Tyreek Hill&lt;/td&gt;
&lt;td&gt;15106.609999999973&lt;/td&gt;
&lt;td&gt;14271.23409723974&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;td&gt;Zay Jones&lt;/td&gt;
&lt;td&gt;14790.589999999967&lt;/td&gt;
&lt;td&gt;14271.23409723974&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;td&gt;Sterling Shepard&lt;/td&gt;
&lt;td&gt;14673.79999999996&lt;/td&gt;
&lt;td&gt;14271.23409723974&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;td&gt;Mike Evans&lt;/td&gt;
&lt;td&gt;14620.129999999983&lt;/td&gt;
&lt;td&gt;14271.23409723974&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;td&gt;Davante Adams&lt;/td&gt;
&lt;td&gt;14574.509999999951&lt;/td&gt;
&lt;td&gt;14271.23409723974&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;td&gt;Kenny Golladay&lt;/td&gt;
&lt;td&gt;14354.499999999973&lt;/td&gt;
&lt;td&gt;14271.23409723974&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WR&lt;/td&gt;
&lt;td&gt;Jarvis Landry&lt;/td&gt;
&lt;td&gt;14281.509999999971&lt;/td&gt;
&lt;td&gt;14271.23409723974&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Where can the data take you?
&lt;/h2&gt;

&lt;p&gt;As you’ve seen in this example, &lt;strong&gt;time-series data is everywhere&lt;/strong&gt;. Being able to harness it gives you a huge advantage, whether you’re working on a professional solution or a personal project.&lt;/p&gt;

&lt;p&gt;We’ve shown you a few ways that time-series queries can unlock interesting insights, give you a greater appreciation for the game and its players, and (hopefully) inspired you to dig into the data yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To get started with the &lt;a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview" rel="noopener noreferrer"&gt;NFL data&lt;/a&gt;:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Spin up a fully managed TimescaleDB service:&lt;/strong&gt; create an account to &lt;a href="https://console.forge.timescale.com/signup/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=forge-signup" rel="noopener noreferrer"&gt;try it for free&lt;/a&gt; for 30 days.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=nfl-tutorial" rel="noopener noreferrer"&gt;Follow our complete tutorial&lt;/a&gt; for step-by-step instructions for preparing and ingesting the dataset, along with several more queries to help you glean insights from the dataset.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re new to time-series data or just have some questions about how to use TimescaleDB to analyze the NFL’s dataset, &lt;a href="https://slack.timescale.com/" rel="noopener noreferrer"&gt;join our public Slack community&lt;/a&gt;. You’ll find Timescale engineers and thousands of time-series enthusiasts from around the world – and we’ll be happy to help you.&lt;/p&gt;

&lt;p&gt;🙏 We’d like to thank the NFL for making this data available, and the millions of passionate fans around the world who make the NFL such an exciting game to watch.&lt;/p&gt;

&lt;p&gt;And, Geaux Saints 🏈!&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://blog.timescale.com/blog/hacking-nfl-data-with-postgresql-timescaledb-and-sql/?utm_source=dev-to&amp;amp;utm_medium=blog-post&amp;amp;utm_content=nfl-blog-post" rel="noopener noreferrer"&gt;original blog post&lt;/a&gt; was a collaboration between&lt;/p&gt;

&lt;p&gt;Attila Toth, Miranda Auhl, Ryan Booz&lt;/p&gt;

</description>
      <category>database</category>
      <category>analytics</category>
      <category>postgres</category>
      <category>timeseries</category>
    </item>
  </channel>
</rss>
