Yitaek Hwang

Posted on Nov 2, 2022 • Originally published at questdb.io

Data Lifecycle with QuestDB

#database #devops #tutorial #programming

For most applications dealing with time series data, the value of each data point diminishes over time as the granularity of the dataset loses relevance as it gets stale. For example, when applying a real-time anomaly detection model, more granular data (e.g., data collected at second resolution), would yield better results. However, to train forecasting models afterwards, recording data at such high frequency may not be needed and would be costly in terms of storage and compute.

When I was working for an IoT company, to combat this issue, we stored data in three separate databases. To show the most up to date value, latest updates were pushed to a NoSQL realtime database. Simultaneously, all the data was appended to both a time series database storing up to 3 months of data for quick analysis and to an OLAP database for long-term storage. To stop the time series database from exploding in size, we also ran a nightly job to delete old data. As the size of the data grew exponentially with IoT devices, this design caused operational issues with maintaining three different databases.

QuestDB solves this by providing easy ways to downsample the data and also detach or drop partitions when old data is no longer necessary. This helps to keep all the data in a single database for most operations and move stale data to cheaper storage in line with a mature data retention policy.

To illustrate, let’s revisit the IoT application involving heart rate data. Unfortunately, Google decided to shut down its Cloud IoT Core service, so we’ll use randomized data for this demo.

Populating heart rate data

Let’s begin by running QuestDB via Docker:

docker run -p 9000:9000 \
-p 9009:9009 \
-p 8812:8812 \
-p 9003:9003 \
-v “$(pwd):/var/lib/questdb” \
questdb/questdb

We’ll create the a simple heart-rate data table with a timestamp, heart rate, and sensor ID partitioned by month via the console at localhost:9000:

CREATE TABLE heart_rate AS(
  SELECT
    x ID,
    timestamp_sequence(
      to_timestamp('2022–10–10T00:00:00', 'yyyy-MM-ddTHH:mm:ss'),
      rnd_long(1, 10, 0) * 100000L
    ) ts,
    rnd_double(0) * 100 + 60 heartrate,
    rnd_long(0, 10000, 0) sensorId
  FROM
    long_sequence(10000000) x
) TIMESTAMP(ts) PARTITION BY MONTH;

We now have randomized data from 10,000 sensors over ~2 months time frame (10M data points). Suppose we are continuously appending to this dataset from a data stream, then having such frequent updates will be useful to detect anomalies in heart rate. This could be useful to detect and alert on health issues that could arise.

Downsampling the data

However, if no anomalies are detected, having a dataset with heart rate collected every second is not useful if we simply want to note general trends over time. Instead we can record the average heart rate in one hour intervals to compact data. For example, if we’re interested in the min, max, and avg heart rate of a specific sensor, sampled every hour, we can invoke:

SELECT
  min(heartrate),
  max(heartrate),
  avg(heartrate),
  ts
FROM
  heart_rate
WHERE
  sensorId = 1000 SAMPLE BY 1h FILL(NULL, NULL, PREV);

Once you are happy with the downsampled results, we can store those results into a separate sampled_data table for other data science time to create forecasting models or do further analysis:

CREATE TABLE sampled_data (ts *timestamp*, min_heartrate *double*, max_heartrate *double*, avg_heartrate *double*, sensorId *long*) *timestamp*(ts);

INSERT INTO sampled_data (ts, min_heartrate, max_heartrate, avg_heartrate, sensorId);

SELECT ts, min(heartrate), max(heartrate), avg(heartrate), sensorId FROM heart_rate SAMPLE BY 1h FILL(NULL, NULL, PREV);

This downsampling operation can be done periodically (e.g., daily, monthly) to populate the new table. This way the data science team does not have to import the massive raw dataset and can simply work with sampled data with appropriate resolution.

Data retention strategy

Downsampling alone, however, does not solve the growing data size. The raw sensor heart_rate table will continue to grow in size. In this case, we have some options in QuestDB to detach or even drop partitions.

Since we partitioned the original dataset by month, we have 3 partitions: 2022–10, 2022–11, and 2022–12. This can be seen under /db/heart_rate/ directories, along with other files holding metadata.

/db/heart_rate
├── 2022–10
├── 2022–11
├── 2022–12

After we have downsampled the data, we probably no longer need data from older months. In this case, we can DETACH this partition to make it unavailable for reads.

ALTER TABLE ‘heart_rate’ DETACH PARTITION LIST ‘2022–10’;

Now the 2022–10 partition is renamed to 2022–10.detached and running queries in the heart_rate table returns data from 2022–11 onwards:

SELECT * FROM ‘heart_rate’ LIMIT 10;

We can then compress this data and move it to a cheaper block storage option like S3 or GCS:

tar cfz — ‘/db/heart_rate/2022–10.detached’ | aws s3 cp — s3://my-data-backups/2022–10.tar.gz

If we need to restore this partition for further analysis, we can re-download the tar file to a new directory named .attachable under /db/ (or where the rest of the QuestDB data lives) and uncompress the tar file:

mkdir 2022–02.attachable | aws s3 cp s3:/my-data-backups/2022–10.tar.gz — | tar xvfz — -C 2022–10.attachable — strip-components 1

With the data in place, simply use the ATTACH command:

ALTER TABLE heart_rate ATTACH PARTITION LIST ‘2022–10’;

We can verify the partition has been attached back by running the count query and seeing 10M records:

SELECT count() FROM heart_rate;

Alternatively, if we want to simply delete partitions, we can use the DROP command to do so. Unlike the DETACH command, this operation is irreversible:

ALTER TABLE heart_rate DROP PARTITION LIST ‘2022–10’;

Simplifying operations

As the volume of data continues to explode, it’s important to consider a data retention strategy that is both cost-effective and useful to teams depending on that data. Time series data, by its nature, lend itself well to aggregation and partitioning by time. We can leverage this quality to serve raw data for teams requiring real-time decisions and then move to a downsampled dataset for other analytic needs. Finally, to control the cost and performance of the time series database, we can detach and store partitions in cheaper, long-term storage options.

QuestDB makes these operations simple with built-in support so that teams don’t have to build custom data pipelines to manually delete and replicate data into different databases. The use case shown here with heart rate data can easily be applied to other industries with high-frequency, time series data (e.g., financial markets, infrastructure monitoring).

DEV Community

Data Lifecycle with QuestDB

Populating heart rate data

Downsampling the data

Data retention strategy

Simplifying operations

Top comments (0)