DEV Community

Cover image for SQL extensions for time series data in QuestDB part II
Kovid Rathee
Kovid Rathee

Posted on

SQL extensions for time series data in QuestDB part II

This tutorial follows up on the one where we introduced SQL extensions in QuestDB that make time-series analysis easier. In this tutorial, you will learn in detail about the SAMPLE BY extension in QuestDB, which will enable you to work with time-series data efficiently because of its simplicity and flexibility.

To get started with this tutorial, you should know that SAMPLE BY is a SQL extension in QuestDB that helps you group or bucket your time-series data based on the designated timestamp. It removes the need for lengthy CASE WHEN statements and GROUP BY clauses. Not only that, the SAMPLE BY extension helps you quickly deal with many other data-related issues, such as missing data, incorrect timezones, and offsets.

This tutorial assumes you have an up-and-running QuestDB instance ready for use. Let's dive straight into it.

Setup

Import sample data

Similar to the previous tutorial, we'll use the NYC taxi riders data for February 2018. You can use the following script utilizing the HTTP REST API to upload data into QuestDB:

curl https://s3-eu-west-1.amazonaws.com/questdb.io/datasets/grafana_tutorial_dataset.tar.gz > grafana_data.tar.gz
tar -xvf grafana_data.tar.gz

curl -F data=@taxi_trips_feb_2018.csv http://localhost:9000/imp
curl -F data=@weather.csv http://localhost:9000/imp
Enter fullscreen mode Exit fullscreen mode

Alternatively, you can utilize the import functionality in the QuestDB console, as shown in the image below:

For importing large CSV files into partitioned tables, QuestDB recommends using the COPY command. Thie method is especially useful when you are trying to migrate data from another database into QuestDB.

Create an ordered timestamp column

QuestDB mandates the use of an ordered timestamp column, so you'll have to cast the pickup_datetime column to TIMESTAMP in a new table called taxi_trips with the script below:

CREATE TABLE taxi_trips AS (
  SELECT *
    FROM 'taxi_trips_feb_2018.csv'
   ORDER BY pickup_datetime
) TIMESTAMP(pickup_datetime)
PARTITION BY MONTH;
Enter fullscreen mode Exit fullscreen mode

By converting the pickup_datetime column to timestamp, you'll allow QuestDB to use it as the designated timestamp. Using the designated timestamp column, QuestDB is able to index the table to run time-based queries more efficiently.

If it all goes well, you should see the following data after running a SELECT * query on the taxi_trips table:

Understanding the basics of SAMPLE BY

The SAMPLE BY extension allows you to create groups and buckets of data based on time ranges. This is especially valuable for time-series data as you can calculate frequently used aggregates with extreme simplicity. SAMPLE BY offers you the ability to summarize or aggregate data from very fine to very coarse units of time, i.e., from microseconds to months and everything in between, i.e., millisecond, second, minute, hour, and day. You can derive other units of time, such as a week, fortnight, and year from the ones provided out of the box.

Let's look at some examples to understand how to use SAMPLE BY in different scenarios.

Hourly count of trips

You can use the SAMPLE BY keyword with the sample unit of h to get an hour-by-hour count of trips for the whole duration of the data set. Running the following query, you'll get results in the console:

SELECT pickup_datetime,
       COUNT() total_trips
  FROM 'taxi_trips'
 SAMPLE BY 1h;
Enter fullscreen mode Exit fullscreen mode

There are two ways you can read your data in the QuestDB console: using the grid, which has a tabular form factor or using a chart, where you can draw up a line chart, a bar graph, or an area chart to visualize your data. Here's an example of a bar chart drawn from the query mentioned above:

Three-hourly holistic summary of trips

The SAMPLE BY extension allows you to group data by any arbitrary number of sample units. In the following example, you'll see that the query is calculating a three-hourly summary of trips with multiple aggregate functions:

SELECT pickup_datetime,
       COUNT() total_trips,
       SUM(passenger_count) total_passengers,
       ROUND(AVG(trip_distance), 2) avg_trip_distance,
       ROUND(SUM(fare_amount)) total_fare_amount,
       ROUND(SUM(tip_amount)) total_tip_amount,
       ROUND(SUM(fare_amount + tip_amount)) total_earnings
  FROM 'taxi_trips'
 SAMPLE BY 3h;
Enter fullscreen mode Exit fullscreen mode

You can view the output of the query in the following grid on the QuestDB console:

Weekly summary of trips

As mentioned earlier in the tutorial, although there's no sample unit for a week, a fortnight, or a year, you can derive them simply by utilizing the built-in sample units. If you want to sample the data by a week, use 7d as the sampling time, as shown in the query below:

SELECT pickup_datetime,
       COUNT() total_trips,
       SUM(passenger_count) total_passengers,
       ROUND(AVG(trip_distance), 2) avg_trip_distance,
       ROUND(SUM(fare_amount)) total_fare_amount,
       ROUND(SUM(tip_amount)) total_tip_amount,
       ROUND(SUM(fare_amount + tip_amount)) total_earnings
  FROM 'taxi_trips'
 WHERE pickup_datetime BETWEEN '2018-02-01' AND '2018-02-28'
 SAMPLE BY 7d;
Enter fullscreen mode Exit fullscreen mode

Dealing with missing data

If you've worked a fair bit with data, you already know that data isn't always in a pristine state. One of the most common issues, especially with time-series data, is discontinuity, i.e., scenarios where data is missing for specific time periods. You can quickly identify and deal with missing data using the advanced functionality of the SAMPLE BY extension.

QuestDB offers an easy way to generate and fill missing data with the SAMPLE BY clause. Take the following example: I've deliberately removed data from 4 am to 5 am for the 1st of February 2018. Notice how the FILL keyword, when used in conjunction with the SAMPLE BY extension, can generate a row for the hour starting at 4 am and fill it with some data:

SELECT pickup_datetime,
       COUNT() total_trips,
       SUM(passenger_count) total_passengers,
       ROUND(AVG(trip_distance), 2) avg_trip_distance,
       ROUND(SUM(fare_amount)) total_fare_amount,
       ROUND(SUM(tip_amount)) total_tip_amount,
       ROUND(SUM(fare_amount + tip_amount)) total_earnings
  FROM 'taxi_trips'
 WHERE pickup_datetime NOT BETWEEN '2018-02-01T04:00:00' AND '2018-02-01T04:59:59'
 SAMPLE BY 1h FILL(LINEAR);
Enter fullscreen mode Exit fullscreen mode

In the example above, we've used an inline WHERE clause to emulate missing clause with the help of the NOT BETWEEN keyword. Alternatively, you can create a separate table with missing trips using the same idea, as shown below:

CREATE TABLE 'taxi_trips_missing' AS (
SELECT * FROM 'taxi_trips'
WHERE pickup_datetime 
  NOT BETWEEN '2018-02-01T04:00:00'
  AND '2018-02-01T04:59:59');
Enter fullscreen mode Exit fullscreen mode

Ideally, you should use DROP PARTITION to emulate missing data, but because the data is partitioned by MONTH, you cannot run the following query:

ALTER TABLE 'taxi_trips'
 DROP PARTITION
WHERE pickup_datetime < ('2018-02-01T04:59:59') 
  AND pickup_datetime > ('2018-02-01T04:00:00');
Enter fullscreen mode Exit fullscreen mode

The FILL keyword demands a fillOption from the following:

fillOption Usage scenario Notes
NONE When you don't want to populate missing data, and leave it as is This is the default fillOption
NULL When you want to generate rows for missing time periods, but leave all the values as NULLs
PREV When you want to copy the values of the previous row from the summarized data This is useful when you expect the numbers to be similar to the preceding time period
LINEAR When you want to normalize the missing values, you can take the average of the immediately preceding and following row
CONST or x When you want to hardcode values where data is missing FILL (column_1, column_2, column_3, ...)

Here's another example of hardcoding values using the FILL(x) fillOption:

Working with timezones and offsets

The SAMPLE BY extension also enables you to change timezones and add or subtract offsets from your timestamp columns to adjust for any issues you might encounter when dealing with different source systems, especially in other geographic areas. It is important to note that, by default, QuestDB aligns its sample calculation based on the FIRST OBSERVATION, as shown in the example below:

SELECT pickup_datetime,
       COUNT() total_trips,
       SUM(passenger_count) total_passengers,
       ROUND(AVG(trip_distance), 2) avg_trip_distance,
       ROUND(SUM(fare_amount)) total_fare_amount,
       ROUND(SUM(tip_amount)) total_tip_amount,
       ROUND(SUM(fare_amount + tip_amount)) total_earnings
  FROM 'taxi_trips'
 WHERE pickup_datetime BETWEEN '2018-02-01T13:35:52' AND '2018-02-28'
 SAMPLE BY 1d;
Enter fullscreen mode Exit fullscreen mode

Note now the 1d sample calculation starts at 13:35:52 and ends at 13:35:51 the next day. Apart from the one demonstrated above, there are two other ways to align your sample calculations -- to the calendar time zone, and to calendar with offset.

Let's look at the other two alignment methods now.

Aligning sample calculation to another timezone

When moving data from one system to another or via a complex pipeline, you can encounter issues with time zones. For the sake of demonstration, let's assume that you've identified that the data set you've loaded into the database is not for New York City but for Melbourne, Australia. These two cities are far apart and are in very different time zones.

QuestDB allows you to fix this issue by aligning your data to another timezone using the ALIGN TO CALENDAR TIME ZONE option with the SAMPLE BY extension. In the example shown below, you can see how an ALIGN TO CALENDAR TIME ZONE ('AEST') has helped align the pickup_datetime, i.e., the designated timestamp column to the AEST timezone for Melbourne.

SELECT pickup_datetime,
       COUNT() total_trips,
       SUM(passenger_count) total_passengers,
       ROUND(AVG(trip_distance), 2) avg_trip_distance,
       ROUND(SUM(fare_amount)) total_fare_amount,
       ROUND(SUM(tip_amount)) total_tip_amount,
       ROUND(SUM(fare_amount + tip_amount)) total_earnings
  FROM 'taxi_trips'
 SAMPLE BY 3h
 ALIGN TO CALENDAR TIME ZONE ('AEST');

Enter fullscreen mode Exit fullscreen mode

Aligning sample calculation with offsets

Similar to the previous example, you can also align your sample calculation by offsetting the designated timestamp column manually by any hh:mm value between -23:59 to 23:59. In the following example, we're offsetting the sample calculation by -5:30, i.e., negative five hours and thirty minutes:

SELECT pickup_datetime,
       COUNT() total_trips,
       SUM(passenger_count) total_passengers,
       ROUND(AVG(trip_distance), 2) avg_trip_distance,
       ROUND(SUM(fare_amount)) total_fare_amount,
       ROUND(SUM(tip_amount)) total_tip_amount,
       ROUND(SUM(fare_amount + tip_amount)) total_earnings
  FROM 'taxi_trips'
 SAMPLE BY 3h
 ALIGN TO CALENDAR WITH OFFSET '-05:30';
Enter fullscreen mode Exit fullscreen mode

Conclusion

In this tutorial, you learned how to exploit the SAMPLE BY extension in QuestDB to work efficiently with time-series data, especially in aggregated form. In addition, the SAMPLE BY extension also allows you to fix specific common problems with time-series data attributable to complex data pipelines, disparate source systems in different geographical areas, software bugs, etc. All in all, SQL extensions like SAMPLE BY provide a significant advantage when working with time-series data by enabling you to achieve more in fewer lines of SQL.

Top comments (0)