DEV Community

kelvinsteve for Timescale

Posted on • Originally published at tsdb.co

How We Made Data Aggregation Better and Faster on PostgreSQL With TimescaleDB 2.7

Time-series data is the lifeblood of the analytics revolution in nearly every industry today. One of the most difficult challenges for application developers and data scientists is aggregating data efficiently without always having to query billions (or trillions) of raw data rows. Over the years, developers and databases have created numerous ways to solve this problem, usually similar to one of the following options:

DIY processes to pre-aggregate data and store it in regular tables. Although this provides a lot of flexibility, particularly with indexing and data retention, it's cumbersome to develop and maintain, particularly deciding how to track and update aggregates with data that arrives late or has been updated in the past.

Extract Transform and Load (ETL) process for longer-term analytics. Even today, development teams employ entire groups that specifically manage ETL processes for databases and applications because of the constant overhead of creating and maintaining the perfect process.

Materialized Views. While these VIEWS are flexible and easy to create, they are static snapshots of the aggregated data. Unfortunately, developers need to manage updates using TRIGGERs or CRON-like applications in all current implementations. And in all but a very few databases, all historical data is replaced each time, preventing developers from dropping older raw data to save space and computation resources every time the data is refreshed.

Most developers head down one of these paths because we learn, often the hard way, that running reports and analytic queries over the same raw data, request after request, doesn't perform well under heavy load. In truth, most raw time-series data doesn't change after it's been saved, so these complex aggregate calculations return the same results each time.

In fact, as a long-term time-series database developer, I've used all of these methods too, so that I could manage historical aggregate data to make reporting, dashboards, and analytics faster and more valuable, even under heavy usage.

I loved when customers were happy, even if it meant a significant amount of work behind the scenes maintaining that data.

But, I always wished for a more straightforward solution.

**

How TimescaleDB Improves Queries on Aggregated Data in PostgreSQL

**

In 2019, TimescaleDB introduced continuous aggregates to solve this very problem, making the ongoing aggregation of massive time-series data easy and flexible. This is the feature that first caught my attention as a PostgreSQL developer looking to build more scalable time-series applications—precisely because I had been doing it the hard way for so long.

Continuous aggregates look and act like materialized views in PostgreSQL, but with many of the additional features I was looking for. These are just some of the things they do:

  • Automatically track changes and additions to the underlying raw data.
  • Provide configurable, user-defined policies to keep the materialized data up-to-date automatically.
  • Automatically append new data (as real-time aggregates by default) before the scheduled process has materialized to disk. This setting is configurable.
  • Retain historical aggregated data even if the underlying raw data is dropped.
  • Can be compressed to reduce storage needs and further improve the performance of analytic queries.
  • Keep dashboards and reports running smoothly.

Image description

**

What About Other Databases?

**

By now, some readers might be thinking something along these lines:

“Continuous aggregates may help with the management and analytics of time-series data in PostgreSQL, but that’s what NoSQL databases are for—they already provide the features you needed from the get-go. Why didn’t you try a NoSQL database?”

Well, I did.

There are numerous time-series and NoSQL databases on the market that attempt to solve this specific problem. I looked at (and used) many of them. But from my experience, nothing can quite match the advantages of a relational database with a feature like continuous aggregates for time-series data. These other options provide a lot of features for a myriad of use cases, but they weren't the right solution for this particular problem, among other things.

What about MongoDB?
MongoDB has been the go-to for many data-intensive applications. Included since version 4.2 is a feature called On-Demand Materialized Views. On the surface, it works similar to a materialized view by combining the Aggregation Pipeline feature with a $merge operation to mimic ongoing updates to an aggregate data collection. However, there is no built-in automation for this process, and MongoDB doesn't keep track of any modifications to underlying data. The developer is still required to keep track of which time frames to materialize and how far back to look.

What about InfluxDB?
For many years InfluxDB has been the destination for time-series applications. Although we've discussed in other articles how InfluxDB doesn't scale effectively, particularly with high cardinality datasets, it does provide a feature called Continuous Queries. This feature is also similar to a materialized view and goes one step further than MongoDB by automatically keeping the dataset updated. Unfortunately, it suffers from the same lack of raw data monitoring and doesn't provide nearly as much flexibility as SQL in how the datasets are created and stored.

What about Clickhouse?
Clickhouse, and several recent forks like Firebolt, have redefined the way some analytic workloads perform. Even with some of the impressive query performance, it provides a mechanism similar to a materialized view as well, backed by an AggregationMergeTree engine. In a sense, this provides almost real-time aggregated data because all inserts are saved to both the regular table and the materialized view. The biggest downside of this approach is dealing with updates or modifying the timing of the process.

Recent Improvements in Continuous Aggregates: Meet TimescaleDB 2.7

Continuous aggregates were first introduced in TimescaleDB 1.3 solving the problems that many PostgreSQL users, including me, faced with time-series data and materialized views: automatic updates, real-time results, easy data management, and the option of using the view for downsampling.

But continuous aggregates have come a long way. One of the previous improvements was the introduction of compression for continuous aggregates in TimescaleDB 2.6. Now, we took it a step further with the arrival of TimescaleDB 2.7, which introduces dramatic performance improvements in continuous aggregates. They are now blazing fast—up to 44,000x faster in some queries than in previous versions.

Let me give you one concrete example: in initial testing using live, real-time stock trade transaction data, typical candlestick aggregates were nearly 2,800x faster to query than in previous versions of continuous aggregates (which were already fast!)

Later in this post, we will dig into the performance and storage improvements introduced by TimescaleDB 2.7 by presenting a complete benchmark of continuous aggregates using multiple datasets and queries. 🔥

But the improvements don’t end here.

First, the new continuous aggregates also require 60 % less storage (on average) than before for many common aggregates, which directly translates into storage savings. Second, in previous versions of TimescaleDB, continuous aggregates came with certain limitations: users, for example, could not use certain functions like DISTINCT, FILTER, or ORDER BY. These limitations are now gone. TimescaleDB 2.7 ships with a completely redesigned materialization process that solves many of the previous usability issues, so you can use any aggregate function to define your continuous aggregate. Check out our release notes for all the details on what's new.

And now, the fun part.

Show Me the Numbers: Benchmarking Aggregate Queries

To test the new version of continuous aggregates, we chose two datasets that represent common time-series datasets: IoT and financial analysis.

IoT dataset (~1.7 billion rows): The IoT data we leveraged is the New York City Taxicab dataset that's been maintained by Todd Schneider for a number of years, and scripts are available in his GitHub repository to load data into PostgreSQL. Unfortunately, a week after his latest update, the transit authority that maintains the actual datasets changed their long-standing export data format from CSV to Parquet—which means the current scripts will not work. Therefore, the dataset we tested with is from data prior to that change and covers ride information from 2014 to 2021.

Stock transactions dataset (~23.7 million rows): The financial dataset we used is a real-time stock trade dataset provided by Twelve Data and ingests ongoing transactions for the top 100 stocks by volume from February 2022 until now. Real-time transaction data is typically the source of many stock trading analysis applications requiring aggregate rollups over intervals for visualizations like candlestick charts and machine learning analysis. While our example dataset is smaller than a full-fledged financial application would maintain, it provides a working example of ongoing data ingestion using continuous aggregates, TimescaleDB native compression, and automated raw data retention (while keeping aggregate data for long-term analysis).
You can use a sample of this data, generously provided by Twelve Data, to try all of the improvements in TimescaleDB 2.7 by following this tutorial, which provides stock trade data for the last 30 days. Once you have the database setup, you can take it a step further by registering for an API key and following our tutorial to ingest ongoing transactions from the Twelve Data API.

Creating Continuous Aggregates Using Standard PostgreSQL Aggregate Functions

The first thing we benchmarked was to create an aggregate query that used standard PostgreSQL aggregate functions like MIN(), MAX(), and AVG(). In each dataset we tested, we created the same continuous aggregate in TimescaleDB 2.6.1 and 2.7, ensuring that both aggregates had computed and stored the same number of rows.

IoT dataset

This continuous aggregate resulted in 1,760,000 rows of aggregated data spanning seven years of data.

CREATE MATERIALIZED VIEW hourly_trip_stats
WITH (timescaledb.continuous, timescaledb.finalized=false)
AS
SELECT
time_bucket('1 hour',pickup_datetime) bucket,
avg(fare_amount) avg_fare,
min(fare_amount) min_fare,
max(fare_amount) max_fare,
avg(trip_distance) avg_distance,
min(trip_distance) min_distance,
max(trip_distance) max_distance,
avg(congestion_surcharge) avg_surcharge,
min(congestion_surcharge) min_surcharge,
max(congestion_surcharge) max_surcharge,
cab_type_id,
passenger_count
FROM
trips
GROUP BY
bucket, cab_type_id, passenger_count

Stock transactions dataset

This continuous aggregate resulted in 950,000 rows of data at the time of testing, although these are updated as new data comes in.

CREATE MATERIALIZED VIEW five_minute_candle_delta
WITH (timescaledb.continuous) AS
SELECT
time_bucket('5 minute', time) AS bucket,
symbol,
FIRST(price, time) AS "open",
MAX(price) AS high,
MIN(price) AS low,
LAST(price, time) AS "close",
MAX(day_volume) AS day_volume,
(LAST(price, time)-FIRST(price, time))/FIRST(price, time) AS change_pct
FROM stocks_real_time srt
GROUP BY bucket, symbol;

To test the performance of these two continuous aggregates, we selected the following queries, all common queries among our users for both the IoT and financial use cases:

  • SELECT COUNT (*)
  • SELECT COUNT (*) with WHERE
  • ORDER BY
  • time_bucket reaggregation
  • FILTER
  • HAVING

Let’s take a look at the results.

Query #1: SELECT COUNT(*) FROM…

Doing a COUNT() from PostgreSQL is a known performance bottleneck. It's one of the reasons we created the approximate_row_count() function in TimescaleDB which uses table statistics to provide a close approximation of the overall row count. However, it's instinctual for most users (and ourselves, if we're honest) to try and get a quick row count by doing a COUNT() query:

-- IoT dataset
SELECT count(*) FROM hourly_trip_stats;

-- Stock transactions dataset
SELECT count(*) FROM five_min_candle_delta;

And most users recognized that in previous versions of TimescaleDB, the materialized data seemed slower than normal to do a COUNT over.

Thinking about our two example datasets, both continuous aggregates reduce the overall row count from raw data by 20x or more. So, while counting rows in PostgreSQL is slow, it always felt a little slower than it had to be. The reason was that not only did PostgreSQL have to scan and count all of the rows of data, it had to group the data a second time because of some additional data that TimescaleDB stored as part of the original design of continuous aggregates. With the new design of continuous aggregates in TimescaleDB 2.7, that second grouping is no longer required, and PostgreSQL can just query the data normally, translating into faster queries.

Image description

Performance of a query with SELECT COUNT (*) in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7

Query #2: SELECT COUNT(*) Based on The Value of a Column

Another common query that many analytic applications perform is to count the number of records where the aggregate value is within a certain range:

-- IoT dataset
SELECT count(*) FROM hourly_trip_stats
WHERE avg_fare > 13.1
AND bucket > '2018-01-01' AND bucket < '2019-01-01';

-- Stock transactions dataset
SELECT count(*) FROM five_min_candle_delta
WHERE change_pct > 0.02;

In previous versions of continuous aggregates, TimescaleDB had to finalize the value before it could be filtered against the predicate value, which caused queries to perform more slowly. With the new version of continuous aggregates, PostgreSQL can now search for the value directly, and we can add an index to meaningful columns to speed up the query even more!

In the case of the financial dataset, we see a very significant improvement: 1,336x faster. The large change in performance can be attributed to the formula query that has to be calculated over all of the rows of data in the continuous aggregate. With the IoT dataset, we're comparing against a simple average function, but for the stock data, multiple values have to be finalized (FIRST/LAST) before the formula can be calculated and used for the filter.

Image description

Performance of a query with SELECT COUNT (*) plus WHERE in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7

Query #3: Select Top 10 Rows by Value

Taking the first example a step further, it's very common to query data within a range of time and get the top rows:

-- IoT dataset
SELECT * FROM hourly_trip_stats
ORDER BY avg_fare desc
LIMIT 10;

-- Stock transactions dataset
SELECT * FROM five_min_candle_delta
ORDER BY change_pct DESC
LIMIT 10;

In this case, we tested queries with the continuous aggregate set to provide real-time results (the default for continuous aggregates) and materialized-only results. When set to real-time, TimescaleDB always queries data that's been materialized first and then appends (with a UNION) any newer data that exists in the raw data but that has not yet been materialized by the ongoing refresh policy. And, because it's now possible to index columns within the continuous aggregate, we added an index on the ORDER BY column.

Image description

Performance of a query with ORDER BY in a continuous aggregate TimescaleDB 2.6.1 and TimescaleDB 2.7

Yes, you read that correctly. Nearly 45,000x better performance on ORDER BY when the query only searches through materialized data.

The dramatic difference between real-time and materialized-only queries is because of the UNION of both materialized and raw aggregate data. The PostgreSQL planner needs to union the total result before it can limit the query to 10 rows (in our example), and so all of the data from both tables need to be read and ordered first. When you only query materialized data, PostgreSQL and TimescaleDB knows that it can query just the index of the materialized data.

Again, storing the finalized form of your data and indexing column values dramatically impacts the querying performance of historical aggregate data! And all of this is updated continuously over time in a non-destructive way—something that's impossible to do with any other relational database, including vanilla PostgreSQL.

Query #4: Timescale Hyperfunctions to Re-aggregate Into Higher Time Buckets

Another example we wanted to test was the impact finalizing data values has on our suite of analytical hyperfunctions. Many of the hyperfunctions we provide as part of the TimescaleDB Toolkit utilize custom aggregate values that allow many different values to be accessed later depending on the needs of an application or report. Furthermore, these aggregate values can be re-aggregated into different size time buckets. This means that if the aggregate functions fit your use case, one continuous aggregate can produce results for many different time_bucket sizes! This is a feature many users have asked for over time, and hyperfunctions make this possible.

For this example, we only examined the New York City Taxicab dataset to benchmark the impact of finalized CAGGs. Currently, there is not an aggregate hyperfunction that aligns with the OHLC values needed for the stock data set, however, there is a feature request for it! (😉)

Although there are not currently any one-to-one hyperfunctions that provide exact replacements for our min/max/avg example, we can still observe the query improvement using a tdigest value for each of the columns in our original query.

Original min/max/avg continuous aggregate for multiple columns:

CREATE MATERIALIZED VIEW hourly_trip_stats
WITH (timescaledb.continuous, timescaledb.finalized=false)
AS
SELECT
time_bucket('1 hour',pickup_datetime) bucket,
avg(fare_amount) avg_fare,
min(fare_amount) min_fare,
max(fare_amount) max_fare,
avg(trip_distance) avg_distance,
min(trip_distance) min_distance,
max(trip_distance) max_distance,
avg(congestion_surcharge) avg_surcharge,
min(congestion_surcharge) min_surcharge,
max(congestion_surcharge) max_surcharge,
cab_type_id,
passenger_count
FROM
trips
GROUP BY
bucket, cab_type_id, passenger_count

Hyperfunction-based continuous aggregate for multiple columns:

CREATE MATERIALIZED VIEW hourly_trip_stats_toolkit
WITH (timescaledb.continuous, timescaledb.finalized=false)
AS
SELECT
time_bucket('1 hour',pickup_datetime) bucket,
tdigest(1,fare_amount) fare_digest,
tdigest(1,trip_distance) distance_digest,
tdigest(1,congestion_surcharge) surcharge_digest,
cab_type_id,
passenger_count
FROM
trips
GROUP BY
bucket, cab_type_id, passenger_count

With the continuous aggregate created, we then queried this data in two different ways:

*1. Using the same time_bucket() size defined in the continuous aggregate, which in this example was one-hour data.
*

SELECT
bucket AS b,
cab_type_id,
passenger_count,
min_val(ROLLUP(fare_digest)),
max_val(ROLLUP(fare_digest)),
mean(ROLLUP(fare_digest))
FROM hourly_trip_stats_toolkit
WHERE bucket > '2021-05-01' AND bucket < '2021-06-01'
GROUP BY b, cab_type_id, passenger_count
ORDER BY b DESC, cab_type_id, passenger_count;

Image description

_Performance of a query with time_bucket() in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7 (the query uses the same bucket size as the definition of the continuous aggregate) _

**2. We re-aggregated the data from one-hour buckets into one-day buckets. **This allows us to efficiently query different bucket lengths based on the original bucket size of the continuous aggregate.

SELECT
time_bucket('1 day', bucket) AS b,
cab_type_id,
passenger_count,
min_val(ROLLUP(fare_digest)),
max_val(ROLLUP(fare_digest)),
mean(ROLLUP(fare_digest))
FROM hourly_trip_stats_toolkit
WHERE bucket > '2021-05-01' AND bucket < '2021-06-01'
GROUP BY b, cab_type_id, passenger_count
ORDER BY b DESC, cab_type_id, passenger_count;

Image description

Performance of a query with time_bucket() in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7. The query re-aggregates the data from one-hour buckets into one-day buckets

In this case, the speed is almost identical because the same amount of data has to be queried. But if these aggregates satisfy your data requirements, only one continuous aggregate would be necessary in many cases, rather than a different continuous aggregate for each bucket size (one minute, five minutes, one hour, etc.)

Query #5: Pivot Queries With FILTER

In previous versions of continuous aggregates, many common SQL features were not permitted because of how the partial data was stored and finalized later. Using a PostgreSQL FILTER clause was one such restriction.

For example, we took the IoT dataset and created a simple COUNT(*) to calculate each company's number of taxi rides ( cab_type_id) for each hour. Before TimescaleDB 2.7, you would have to store this data in a narrow column format, storing a row in the continuous aggregate for each cab type.

CREATE MATERIALIZED VIEW hourly_ride_counts_by_type
WITH (timescaledb.continuous, timescaledb.finalized=false)
AS
SELECT
time_bucket('1 hour',pickup_datetime) bucket,
cab_type_id,
COUNT(*)
FROM trips
WHERE cab_type_id IN (1,2)
GROUP BY
bucket, cab_type_id;

To then query this data in a pivoted fashion, we could FILTER the continuous aggregate data after the fact.

SELECT bucket,
sum(count) FILTER (WHERE cab_type_id IN (1)) yellow_cab_count,
sum(count) FILTER (WHERE cab_type_id IN (2)) green_cab_count
FROM hourly_ride_counts_by_type
WHERE bucket > '2021-05-01' AND bucket < '2021-06-01'
GROUP BY bucket
ORDER BY bucket;

In TimescaleDB 2.7, you can now store the aggregated data using a FILTER clause to achieve the same result in one step!

CREATE MATERIALIZED VIEW hourly_ride_counts_by_type_new
WITH (timescaledb.continuous)
AS
SELECT
time_bucket('1 hour',pickup_datetime) bucket,
COUNT() FILTER (WHERE cab_type_id IN (1)) yellow_cab_count,
COUNT(
) FILTER (WHERE cab_type_id IN (2)) green_cab_count
FROM trips
GROUP BY
bucket;

Querying this data is much simpler, too, because the data is already pivoted and finalized.

SELECT * FROM hourly_ride_counts_by_type_new
WHERE bucket > '2021-05-01' AND bucket < '2021-06-01'
ORDER BY bucket;

This saves storage (50 % fewer rows in this case) and CPU to finalize the COUNT(*) and then filter the results each time based on cab_type_id. We can see this in the query performance numbers.

Image description

_Performance of a query with FILTER in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7. _

Being able to use FILTER and other SQL features improve both developer experience and flexibility long term!

Query #6: HAVING Stores Significantly Less Materialized Data

As a final example of how the improvements to continuous aggregates will impact your day-to-day development and analytics processes, let's look at a simple query that uses a HAVING clause to reduce the number of rows that the aggregate stores.

In previous versions of TimescaleDB, the having clause couldn't be applied at materialization time. Instead, the HAVING clause was applied after the fact to all of the aggregated data as it was finalized. In many cases, this dramatically affected both the speed of queries to the continuous aggregate and the amount of data stored overall.

Using our stock data as an example, let's create a continuous aggregate that only stores a row of data if the change_pct value is greater than 20 %. This would indicate that a stock price changed dramatically over one hour, something we don't expect to see in most hourly stock trades.

CREATE MATERIALIZED VIEW one_hour_outliers
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', time) AS bucket,
symbol,
FIRST(price, time) AS "open",
MAX(price) AS high,
MIN(price) AS low,
LAST(price, time) AS "close",
MAX(day_volume) AS day_volume,
(LAST(price, time)-FIRST(price, time))/LAST(price, time) AS change_pct
FROM stocks_real_time srt
GROUP BY bucket, symbol
HAVING (LAST(price, time)-FIRST(price, time))/LAST(price, time) > .02;

Once the dataset is created, we can query each aggregate to see how many rows matched our criteria.

SELECT count(*) FROM one_hour_outliers;

Image description

Table comparing the performance of a query with HAVING in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7.

The biggest difference here (and the one that will more negatively impact the performance of your application over time) is the storage size of this aggregated data. Because TimescaleDB 2.7 only stores rows that meet the criteria, the data footprint is significantly smaller!

Image description

Storage footprint of a continuous aggregate bucketing stock transactions by the hour in TimescaleDB 2.6.1 and TimescaleDB 2.7

Storage Savings in TimescaleDB 2.7

One of the final pieces of this update that excites us is how much storage will be saved over time. On many occasions, users with large datasets that contained complex equations in their continuous aggregates would join our Slack community to ask why more storage is required for the rolled-up aggregate than the raw data.

In every case we've tested, the new, finalized form of continuous aggregates is smaller than the same example in previous versions of TimescaleDB, with or without a HAVING clause that might filter additional data out.

Image description

Storage savings for different continuous aggregates in TimescaleDB 2.6.1 and TimescaleDB 2.7

The New Continuous Aggregates Are a Game-Changer

For those dealing with massive amounts of time-series data, continuous aggregates are the best way to solve a problem that has long haunted PostgreSQL users. The following list details how continuous aggregates expand materialized views:

  • They always stay up-to-date, automatically tracking changes in the source table for targeted, efficient updates of materialized data.
  • You can use configurable policies to conveniently manage refresh/update interval.
  • You can keep your materialized data even after the raw data is dropped, allowing you to downsample your large datasets.
  • And you can compress older data to save space and improve analytic queries.
  • And in TimescaleDB 2.7, continuous aggregates got much better.

First, they are blazing fast: as we demonstrated with our benchmark, the performance of continuous aggregates got consistently better across queries and datasets, up to thousands of times better for common queries. They also got lighter, requiring an average of 60 % less storage.

But besides the performance improvements and storage savings, there are significantly fewer limitations on the types of aggregate queries you can use with continuous aggregates, such as:

  • Aggregates with DISTINCT
  • Aggregates with FILTER
  • Aggregates with FILTER in HAVING clause
  • Aggregates without combine function
  • Ordered-set aggregates
  • Hypothetical-set aggregates

This new version of continuous aggregates is available by default in TimescaleDB 2.7: now, when you create a new continuous aggregate, you will automatically benefit from all the latest changes. For your existing continuous aggregates, we recommend that you recreate them in the latest version to take advantage of all these improvements. Read our release notes for more information on TimescaleDB 2.7, and for instructions on how to upgrade, check out our docs.

Top comments (0)