DEV Community: davidkohn88

How percentile approximation works (and why it's more useful than averages)

davidkohn88 — Wed, 24 Nov 2021 21:37:13 +0000

Things I forgot from 7th grade math: percentiles vs. averages
Long tails, outliers, and real effects: Why percentiles are better than averages for understanding your data
How percentiles work in PostgreSQL
Percentile approximation: what it is and why we use it in TimescaleDB hyperfunctions
Percentile approximation deep dive: approximation methods, how they work, and how to choose
Wrapping it up

Get a primer on percentile approximations, why they're useful for analyzing large time-series data sets, and how we created the percentile approximation hyperfunctions to be efficient to compute, parallelizable, and useful with continuous aggregates and other advanced TimescaleDB features.

In my recent post on time-weighted averages, I described how my early career as an electrochemist exposed me to the importance of time-weighted averages, which shaped how we built them into TimescaleDB hyperfunctions. A few years ago, soon after I started learning more about PostgreSQL internals (check out my aggregation and two-step aggregates post to learn about them yourself!), I worked on backends for an ad analytics company, where I started using TimescaleDB.

Like most companies, we cared a lot about making sure our website and API calls returned results in a reasonable amount of time for the user; we had billions of rows in our analytics databases, but we still wanted to make sure that the website was responsive and useful.

There’s a direct correlation between website performance and business results: users get bored if they have to wait too long for results, which is obviously not ideal from a business and customer loyalty perspective. To understand how our website performed and find ways to improve, we tracked the timing of our API calls and used API call response time as a key metric.

Monitoring an API is a common scenario and generally falls under the category of application performance monitoring (APM), but there are lots of similar scenarios in other fields including:

Predictive maintenance for industrial machines
Fleet monitoring for shipping companies
Energy and water use monitoring and anomaly detection

Of course, analyzing raw (usually time-series) data only gets you so far. You want to analyze trends, understand how your system performs relative to what you and your users expect, and catch and fix issues before they impact production users, and so much more. We built TimescaleDB hyperfunctions to help solve this problem and simplify how developers work with time-series data.

For reference, hyperfunctions are a series of SQL functions that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code. You can use hyperfunctions to calculate percentile approximations of data, compute time-weighted averages, downsample and smooth data, and perform faster COUNT DISTINCT queries using approximations. Moreover, hyperfunctions are “easy” to use: you call a hyperfunction using the same SQL syntax you know and love.

We spoke with community members to understand their needs, and our initial release includes some of the most frequently requested functions, including percentile approximations (see GitHub feature request and discussion). They’re very useful for working with large time-series data sets because they offer the benefits of using percentiles (rather than averages or other counting statistics) while still being quick and space-efficient to compute, parallelizable, and useful with continuous aggregates and other advanced TimescaleDB features.

If you’d like to get started with the percentile approximation hyperfunctions - and many more - right away, spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days. (Hyperfunctions are pre-loaded on each new database service on Timescale Cloud, so after you create a new service, you’re all set to use them).

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub, after which you’ll be able to use percentile approximation and other hyperfunctions.

Finally, we love building in public and continually improving:

If you have questions or comments on this blog post, we’ve started a discussion on our GitHub page, and we’d love to hear from you. And, if you like what you see, GitHub ⭐ are always welcome and appreciated too!
You can view our upcoming roadmap on GitHub for a list of proposed features, as well as features we’re currently implementing and those that are available to use today.

Things I forgot from 7th grade math: percentiles vs. averages

I probably learned about averages, medians, and modes in 7th grade math class, but if you’re anything like me, they may periodically get lost in the cloud of “things I learned once and thought I knew, but actually, I don’t remember quite as well as I thought.”

As I was researching this piece, I found a number of good blog posts (see examples from the folks at Dynatrace, Elastic, AppSignal, and Optimizely) about how averages aren’t great for understanding application performance, or other similar things, and why it’s better to use percentiles.

I won’t spend too long on this, but I think it’s important to provide a bit of background on why and how percentiles can help us better understand our data.

First off, let’s consider how percentiles and averages are defined. To understand this, let’s start by looking at a normal distribution:

A normal, or Gaussian, distribution describes many real-world processes that fall around a given value and where the probability of finding values that are further from the center decreases. The median, average, and mode are all the same for a normal distribution, and they fall on the dotted line at the center.

The normal distribution is what we often think of when we think about statistics; it’s one of the most frequently used and often used in introductory courses. In a normal distribution, the median, the average (also known as the mean), and the mode are all the same, even though they’re defined differently.

The median is the middle value, where half of the data is above and half is below. The mean (aka average) is defined as the sum(value) / count(value), and the mode is defined as the most common or frequently occurring value.

When we’re looking at a curve like this, the x-axis represents the value, while the y-axis represents the frequency with which we see a given value (i.e., values that are “higher” on the y-axis occur more frequently).

In a normal distribution, we see a curve centered (the dotted line) at its most frequent value, with decreasing probability of seeing values further away from the most frequent one (the most frequent value is the mode). Note that the normal distribution is symmetric, which means that values to the left and right of the center have the same probability of occurring.

The median, or the middle value, is also known as the 50th percentile (the middle percentile out of 100). This is the value at which 50% of the data is less than the value, and 50% is greater than the value (or equal to it).

In the below graph, half of the data is to the left (shaded in blue), and a half is to the right (shaded in yellow), with the 50th percentile directly in the center.

A normal distribution with the median/50th percentile depicted.

This leads us to percentiles: a percentile is defined as the value where x percent of the data falls below the value.

For example, if we call something “the 10th percentile,” we mean that 10% of the data is less than the value and 90% is greater than (or equal to) the value.

A normal distribution with the 10th percentile depicted.

And the 90th percentile is where 90% of the data is less than the value and 10% is greater:

A normal distribution with the 90th percentile depicted.

To calculate the 10th percentile, let’s say we have 10,000 values. We take all of the values, order them from smallest to largest, and identify the 1001st value (where 1000 or 10% of the values are below it), which will be our 10th percentile.

We noted before that the median and average are the same in a normal distribution. This is because a normal distribution is symmetric. Thus, the magnitude and number of points with values larger than the median are completely balanced (both in magnitude and number of points smaller than the median).

In other words, there is always the same number of points on either side of the median, but the average takes into account the actual value of the points.

For the median and average to be equal, the points less than the median and greater than the median must have the same distribution (i.e., there must be the same number of points that are somewhat larger and somewhat smaller and much larger and much smaller). (Correction: as pointed out to us in a helpful comment on Hacker News, technically this is only true for symmetric distributions, asymmetric distributions it may or may not be true for and you can get odd cases of asymmetric distributions where these are equal, though they are less likely!)

Why is this important? The fact that median and average are the same in the normal distribution can cause some confusion. Since a normal distribution is often one of the first things we learn, we (myself included!) can think it applies to more cases than it actually does.

It’s easy to forget or fail to realize, that only the median guarantees that 50% of the values will be above, and 50% below – while the average guarantees that 50% of the weighted values will be above and 50% below (i.e., the average is the centroid, while the median is the center).

The average and median are the same in a normal distribution, and they split the graph exactly in half. But they aren’t calculated the same way, don’t represent the same thing, and aren’t necessarily the same in other distributions.

🙏 Shout out to the folks over at Desmos for their great graphing calculator, which helped make these graphs, and even allowed me to make an interactive demonstration of these concepts!

But, to get out of the theoretical, let’s consider something more common in the real world, like the API response time scenario from my work at the ad analytics company.

Long tails, outliers, and real effects: Why percentiles are better than averages for understanding your data

We looked at how averages and percentiles are different – and now, we’re going to use a real-world scenario to demonstrate how using averages instead of percentiles can lead to false alarms or missed opportunities.

Why? Averages don’t always give you enough information to distinguish between real effects and outliers or noise, whereas percentiles can do a much better job.

Simply put, using averages can have a dramatic (and negative) impact on how values are reported, while percentiles can help you get closer to the “truth.”

If you’re looking at something like API response time, you’ll likely see a frequency distribution curve that looks something like this:

A frequency distribution for API response times with a peak at 250ms (all graphs are not to scale and are meant only for demonstration purposes).

In my former role at the ad analytics company, we’d aim for most of our API response calls to finish in under half a second, and many were much, much shorter than that. When we monitored our API response times, one of the most important things we tried to understand was how users were affected by changes in the code.

Most of our API calls finished in under half a second, but some people used the system to get data over very long time periods or had odd configurations that meant their dashboards were a bit less responsive (though we tried to make sure those were rare!).

The type of curve that resulted is characterized as a long-tail distribution where we have a relatively large spike at 250 ms, with a lot of our values under that and then an exponentially decreasing number of longer response times.

We talked earlier about how in symmetric curves (like the normal distribution), but a long-tail distribution is an asymmetric curve.

This means that the largest values are much larger than the middle values, while the smallest values aren’t that far from the middle values. (In the API monitoring case, you can never have an API call that takes less than 0 s to respond, but there’s no limit to how long they can take, so you get that long tail of longer API calls).

Thus, the average and the median of a long-tail distribution start to diverge:

The API response time frequency curve with the median and average labeled. Graphs are not to scale and are meant for demonstration purposes only.

In this scenario, the average is significantly larger than the median because there are enough “large” values in the long tail to make the average larger. Conversely, in some other cases, the average might be smaller than the median.

But at the ad analytics company, we found that the average didn’t give us enough information to distinguish between important changes in how our API responded to software changes vs. noise/outliers that only affected a few individuals.

In one case, we introduced a change to the code that had a new query. The query worked fine in staging, but there was a lot more data in the production system.

Once the data was “warm” (in memory), it would run quickly, but it was very slow the first time. When the query went into production, the response time was well over a second for ~10% of the calls.

In our frequency curve, a response time over a second (but less than 10s) for ~10% of the calls resulted in a second, smaller hump in our frequency curve and looked like this:

A frequency curve showing the shift and extra hump that occurs when 10% of calls take a moderate amount of time, between 1 and 10s (graph still not to scale).

In this scenario, the average shifted a lot, while the median slightly shifted, it’s much less impacted.

You might think that this makes the average a better metric than the median because it helped us identify the problem (too long API response times), and we could set up our alerting to notify when the average shifts.

Let’s imagine that we’ve done that, and people will jump into action when the average goes above, say, 1 second(s).

But now, we get a few users who start requesting 15 years of data from our UI...and those API calls take a really long time. This is because the API wasn’t really built to handle this “off-label” use.

Just a few calls from these users easily shifted the average way over our 1s threshold.

Why? The average (as a value) can be dramatically affected by outliers like this, even though they impact only a small fraction of our users. The average uses the sum of the data, so the magnitude of the outliers can have an outsized impact, whereas the median and other percentiles are based on the ordering of the data.

Our curve with a few outliers, where less than 1% of the API call responses are over 100s (the response time has a break representing the fact that the outliers would be way to the right otherwise, still, the graph is not to scale).

The point is that the average doesn’t give us a good way to distinguish between outliers and real effects and can give odd results when we have a long-tail or asymmetric distribution.

Why is this important to understand?

Well, in the first case, we had a problem affecting 10% of our API calls, which could be 10% or more of our users (how could it affect more than 10% of the users? Well, if a user makes 10 calls on average, and 10% of API calls are affected, then, on average, all the users would be affected... or at least some large percentage of them).

We want to respond very quickly to that type of urgent problem, affecting a large number of users. We built alerts and might even get our engineers up in the middle of the night and/or revert a change.

But the second case, where “off-label” user behavior or minor bugs had a large effect on a few API calls, was much more benign. Because relatively few users are affected by these outliers, we wouldn’t want to get our engineers up in the middle of the night or revert a change. (Outliers can still be important to identify and understand, both for understanding user needs or potential bugs in the code, but they usually aren’t an emergency).

Instead of using the average, we can instead use multiple percentiles to understand this type of behavior. Remember, unlike averages, percentiles rely on the ordering of the data rather than being impacted by the magnitude of data. If we use the 90th percentile, we know that 10% of users have values (API response times in our case) greater than it.

Let’s look at the 90th percentile in our original graph; it nicely captures some of the long tail behavior:

Our original API response time graph showing the 90th percentile, median, and average. Graph not to scale.

When we have some outliers caused by a few users who’re running super long queries or a bug affecting a small group of queries, the average shifts, but the 90th percentile is hardly affected.

Outliers affect the average but don’t impact the 90th percentile or median. (Graph is not to scale.)

But, when the tail is increased due to a problem affecting 10% of users, we see that the 90th percentile shifts outward pretty dramatically – which enables our team to be notified and respond appropriately:

But when there are “real” effects from responses that impact more than 10% of users, the 90th percentile shifts dramatically (Graph not to scale.)

This (hopefully) gives you a better sense of how and why percentiles can help you identify cases where large numbers of users are affected – but not burden you with false positives that might wake engineers up and give them alarm fatigue!

So, now that we know why we might want to use percentiles rather than averages, let’s talk about how we calculate them.

How percentiles work in PostgreSQL

To calculate any sort of exact percentile, you take all your values, sort them, then find the nth value based on the percentile you’re trying to calculate.

To see how this works in PostgreSQL, we’ll present a simplified case of our ad analytics company’s API tracking.

We’ll start off with a table like this:

CREATE TABLE responses(
    ts timestamptz, 
    response_time DOUBLE PRECISION);

In PostgreSQL we can calculate a percentile over the column response_time using the percentile_disc aggregate:

SELECT 
    percentile_disc(0.5) WITHIN GROUP (ORDER BY response_time) as median
FROM responses;

This doesn’t look the same as a normal aggregate; the WITHIN GROUP (ORDER BY …) is a different syntax that works on special aggregates called ordered-set aggregates.

Here we pass in the percentile we want (0.5 or the 50th percentile for the median) to the percentile_disc function, and the column that we’re evaluating (response_time) goes in the order by clause.

It will be more clear why this happens when we understand what’s going on under the hood. Percentiles give a guarantee that x percent of the data will fall below the value they return. To calculate that, we need to sort all of our data in a list and then pick out the value where 50% of the data falls below it, and 50% falls above it.

For those of you who read the section of our previous post on how PostgreSQL aggregates work, we discussed how an aggregate like avg works.

As it scans each row, the transition function updates some internal state (for avg it’s the sum and the count), and then a final function processes the internal state to produce a result (for avg divide sum by count).

A GIF showing how the avg is calculated in PostgreSQL with the sum and count as the partial state as rows are processed and a final function that divides them when we’ve finished.

The ordered set aggregates, like percentile_disc, work somewhat similarly, with one exception: instead of the state being a relatively small fixed-size data structure (like sum and count for avg ), it must keep all the values it has processed to sort them and calculate the percentile later.

Usually, PostgreSQL does this by putting the values into a data structure called a tuplestore that stores and sorts values easily.

Then, when the final function is called, the tuplestore will first sort the data. Then, based on the value input into the percentile_disc), it will traverse to the correct point (0.5 of the way through the data for the median) in the sorted data and output the result.

With the `percentile_disc` ordered set aggregate, PostgreSQL has to store each value it sees in a `tuplestore` then when it’s processed all the rows, it sorts them, and then goes to the right point in the sorted list to extract the percentile we need.

Instead of performing these expensive calculations over very large data sets, many people find that approximate percentile calculations can provide a “close enough” approximation with significantly less work...which is why we introduced percentile approximation hyperfunctions.

Percentile approximation: what it is and why we use it in TimescaleDB hyperfunctions

In my experience, people often use averages and other summary statistics more frequently than percentiles because they are significantly “cheaper” to calculate over large datasets, both in computational resources and time.

As we noted above, calculating the average in PostgreSQL has a simple, two-valued aggregate state. Even if we calculate a few additional, related functions like the standard deviation, we still just need a small, fixed number of values to calculate the function.

In contrast, to calculate the percentile, we need all of the input values in a sorted list.

This leads to a few issues:

Memory footprint: The algorithm has to keep these values somewhere, which means keeping values in memory until they need to write some data to disk to avoid using too much memory (this is known as “spilling to disk”). This produces a significant memory burden and/or majorly slows down the operation because disk accesses are orders of magnitude slower than memory.
Limited Benefits from Parallelization: Even though the algorithm can sort lists in parallel, the benefits from parallelization are limited because it still needs to merge all the sorted lists into a single, sorted list in order to calculate a percentile.
High network costs: In distributed systems (like TimescaleDB multi-node), all the values must be passed over the network to one node to be made into a single sorted list, which is slow and costly.
No true partial states: Materialization of partial states (e.g., for continuous aggregates) is not useful because the partial state is simply all the values that underlie it. This could save on sorting the lists, but the storage burden would be high and the payoff low.
No streaming algorithm: For streaming data, this is completely infeasible. You still need to maintain the full list of values (similar to the materialization of partial states problem above), which means that the algorithm essentially needs to store the entire stream!

All of these can be manageable when you’re dealing with relatively small data sets, while for high volume, time-series workloads, they start to become more of an issue.

But, you only need the full list of values for calculating a percentile if you want exact percentiles. With relatively large datasets, you can often accept some accuracy tradeoffs to avoid running into any of these issues.

The problems above, and the recognition of the tradeoffs involved in weighing whether to use averages or percentiles, led to the development of multiple algorithms to approximate percentiles in high volume systems. Most percentile approximation approaches involve some sort of modified histogram to represent the overall shape of the data more compactly, while still capturing much of the shape of the distribution.

As we were designing hyperfunctions, we thought about how we could capture the benefits of percentiles (e.g., robustness to outliers, better correspondence with real-world impacts) while avoiding some of the pitfalls that come with calculating exact percentiles (above).

Percentile approximations seemed like the right fit for working with large, time-series datasets.

The result is a whole family of percentile approximation hyperfunctions, built into TimescaleDB. The simplest way to call them is to use the percentile_agg aggregate along with the approx_percentile accessor.

This query calculates approximate 10th, 50th, and 90th percentiles:

SELECT 
    approx_percentile(0.1, percentile_agg(response_time)) as p10, 
    approx_percentile(0.5, percentile_agg(response_time)) as p50, 
    approx_percentile(0.9, percentile_agg(response_time)) as p90 
FROM responses;

(If you’d like to learn more about aggregates, accessors, and two-step aggregation design patterns, check out our primer on PostgreSQL two-step aggregation.)

These percentile approximations have many benefits when compared to the normal PostgreSQL exact percentiles, especially when used for large data sets.

Memory footprint

When calculating percentiles over large data sets, our percentile approximations limit the memory footprint (or need to spill to disk, as described above).

Standard percentiles create memory pressure since they build up as much of the data set in memory as possible...and then slow down when forced to spill to disk.

Conversely, hyperfunctions’ percentile approximations have fixed size representations based on the number of buckets in their modified histograms, so they limit the amount of memory required to calculate them.

Parallelization in single and multi-node TimescaleDB

All of our percentile approximation algorithms are parallelizable, so they can be computed using multiple workers in a single node; this can provide significant speedups because ordered-set aggregates like percentile_disc are not parallelizable in PostgreSQL.

Parallelizability provides a speedup in single node setups of TimescaleDB – and this can be even more pronounced in multi-node TimescaleDB setups.

Why? To calculate a percentile in multi-node TimescaleDB using the percentile_disc ordered-set aggregate (the standard way you would do this without our approximation hyperfunctions), you must send each value back from the data node to the access node, sort the data, and then provide an output.

When calculating the exact percentile in TimescaleDB multi-node, each data node must send all of the data back to the access node. The access node then sorts and calculates the percentile

The “standard” way is very, very costly because all of the data needs to get sent to the access node over the network from each data node, which is slow and expensive.

Even after the access node gets the data, it still needs to sort and calculate the percentile over all that data before returning a result to the user. (Caveat: there is the possibility that each data node could sort separately, and the access node would just perform a merge sort. But, this wouldn’t negate the need for sending all the data over the network, which is the most costly step.)

With approximate percentile hyperfunctions, much more of the work can be pushed down to the data node. Partial approximate percentiles can be computed on each data node, and a fixed size data structure returned over the network.

Once each data node calculates its partial data structure, the access node combines these structures, calculates the approximate percentile, and returns the result to the user.

This means that more work can be done on the data nodes and, most importantly, far, far less data has to be passed over the network. With large datasets, this can result in orders of magnitude less time spent on these calculations.

Using our percentile approximation hyperfunctions, the data nodes no longer have to send all of the data back to the access node. Instead, they calculate a partial approximation and send them back to the access node, which then combines the partials and produces a result. This saves a lot of time on network calls since it parallelizes the computation over the data nodes, rather than performing much of the work on the access node.

Materialization in continuous aggregates

TimescaleDB includes a feature called continuous aggregates, designed to make queries on very large datasets run faster.

TimescaleDB continuous aggregates continuously and incrementally store the results of an aggregation query in the background, so when you run the query, only the data that has changed needs to be computed, not the entire dataset.

Unfortunately, exact percentiles using percentile_disc cannot be stored in continuous aggregates because they cannot be broken down into a partial form, and would instead require storing the entire dataset inside the aggregate.

We designed our percentile approximation algorithms to be usable with continuous aggregates. They have fixed-size partial representations that can be stored and re-aggregated inside of continuous aggregates.

This is a huge advantage compared to exact percentiles because now you can do things like baselining and alerting on longer periods, without having to re-calculate from scratch every time.

Let’s go back to our API response time example and imagine we want to identify recent outliers to investigate potential problems.

One way to do that would be to look at everything that is, say, above the 99th percentile in the previous hour.

As a reminder, we have a table:

CREATE TABLE responses(
    ts timestamptz, 
    response_time DOUBLE PRECISION);
SELECT create_hypertable('responses', 'ts'); -- make it a hypertable so we can make continuous aggs

First, we’ll create a one hour aggregation:

CREATE MATERIALIZED VIEW responses_1h_agg
WITH (timescaledb.continuous)
AS SELECT 
    time_bucket('1 hour'::interval, ts) as bucket,
    percentile_agg(response_time)
FROM responses
GROUP BY time_bucket('1 hour'::interval, ts);

Note that we don’t perform the accessor function in the continuous aggregate; we just perform the aggregation function.

Now, we can find the data in the last 30s greater than the 99th percentile like so:

SELECT * FROM responses 
WHERE ts >= now()-'30s'::interval
AND response_time > (
    SELECT approx_percentile(0.99, percentile_agg)
    FROM responses_1h_agg
    WHERE bucket = time_bucket('1 hour'::interval, now()-'1 hour'::interval)
);

At the ad analytics company, we had a lot of users, so we’d have tens or hundreds of thousands of API calls every hour.

By default, we have 200 buckets in our representation, so we’re getting a large reduction in the amount of data that we store and process by using a continuous aggregate. This means that it would speed up the response time significantly. If you don’t have as much data, you’ll want to increase the size of your buckets or decrease the fidelity of the approximation to achieve a large reduction in the data we have to process.

We mentioned that we only performed the aggregate step in the continuous aggregate view definition; we didn’t use our approx_percentile accessor function directly in the view. We do that because we want to be able to use other accessor functions and/or the rollup function, which you may remember as one of the main reasons we chose the two-step aggregate approach.

Let’s look at how that works, we can create a daily rollup and get the 99th percentile like this:

SELECT 
    time_bucket('1 day', bucket),
    approx_percentile(0.99, rollup(percentile_agg)) as p_99_daily
FROM responses_1h_agg
GROUP BY 1;

We could even use the approx_percentile_rank accessor function, which tells you what percentile a value would fall into.

Percentile rank is the inverse of the percentile function; in other words, if normally you ask, what is the value of nth percentile? The answer is a value.

With percentile rank, you ask what percentile would this value be in? The answer is a percentile.

So, using approx_percentile_rank allows us to see where the values that arrived in the last 5 minutes rank compared to values in the last day:

WITH last_day as (SELECT 
    time_bucket('1 day', bucket),
    rollup(percentile_agg) as pct_daily
FROM foo_1h_agg
WHERE bucket >= time_bucket('1 day', now()-'1 day'::interval)
GROUP BY 1)

SELECT approx_percentile_rank(response_time, pct_daily) as pct_rank_in_day
FROM responses, last_day
WHERE foo.ts >= now()-'5 minutes'::interval;

This is another way continuous aggregates can be valuable.

We performed a rollup over a day, which just combined 24 partial states, rather than performing a full calculation over 24 hours of data with millions of data points.

We then used the rollup to see how that impacted just the last few minutes of data, giving us insight into how the last few minutes compare to the last 24 hours. These are just a few examples of how the percentile approximation hyperfunctions can give us some pretty nifty results and allow us to perform complex analysis relatively simply.

Percentile approximation deep dive: approximation methods, how they work, and how to choose

Some of you may be wondering how TimescaleDB hyperfunctions’ underlying algorithms work, so let’s dive in! (For those of you who don’t want to get into the weeds, feel free to skip over this bit.)

Approximation methods and how they work

We implemented two different percentile approximation algorithms as TimescaleDB hyperfunctions: UDDSketch and T-Digest. Each is useful in different scenarios, but first, let’s understand some of the basics of how they work.

Both use a modified histogram to approximate the shape of a distribution. A histogram buckets nearby values into a group and tracks their frequency.

You often see a histogram plotted like so:

A histogram representing the same data as our response time frequency curve above, you can see how the shape of the graph is similar to the frequency curve. Not to scale.

If you compare this to the frequency curve we showed above, you can see how this could provide a reasonable approximation of the API response time vs frequency response. Essentially, a histogram has a series of bucket boundaries and a count of the number of values that fall within each bucket.

To calculate the approximate percentile for, say, the 20th percentile, you first consider the fraction of your total data that would represent it. For our 20th percentile, that would be 0.2 * total_points.

Once you have that value, you can then sum the frequencies in each bucket, left to right, to find at which bucket you get the value closest to 0.2 * total_points.

You can even interpolate between buckets to get more exact approximations when the bucket spans a percentile of interest.

When you think of a histogram, you may think of one that looks like the one above, where the buckets are all the same width.

But choosing the bucket width, especially for widely varying data, can get very difficult or lead you to store a lot of extra data.

In our API response time example, we could have data spanning from tens of milliseconds up to ten seconds or hundreds of seconds.

This means that the right bucket size for a good approximation of the 1st percentile, e.g., 2ms, would be WAY smaller than necessary for a good approximation of the 99th percentile.

This is why most percentile approximation algorithms use a modified histogram with a variable bucket width.

For instance, the UDDSketch algorithm uses logarithmically sized buckets, which might look something like this:

A modified histogram showing how logarithmic buckets like the UDDSketch algorithm uses can still represent the data. (Note: we’d need to modify the plot to plot the frequency/bucket width so that the scale would remain similar; however, this is just for demonstration purposes and not drawn to scale).

The designers of UDDSketch used a logarithmic bucket size like this because what they care about is the relative error.

For reference, absolute error is defined as the difference between the actual and the approximated value:

er r_{absolute} = ab s (v_{actual} - v_{approx})

To get relative error, you divide the absolute error by the value:

er r_{relative} = \frac{er r _{absolute}}{v _{actual}}

If we had a constant absolute error, we might run into a situation like the following:

We ask for the 99th percentile, and the algorithm tells us it’s 10s +/- 100ms. Then, we ask for the 1st percentile, and the algorithm tells us it’s 10ms +/- 100ms.

The error for the 1st percentile is way too high!

If we have a constant relative error, then we’d get 10ms +/- 100 microseconds.

This is much, much more useful. (And 10s +/- 100 microseconds is probably too tight, we likely don’t really care about 100 microseconds if we’re already at 10s.)

This is why the UDDSketch algorithm uses logarithmically sized buckets, where the width of the bucket scales with the size of the underlying data. This allows the algorithm to provide constant relative error across the full range of percentiles.

As a result, you always know that the true value of the percentile will fall within some range
$([v_{approx} (1 - err), v_{approx} (1 + err)])$

On the other hand, T-Digest uses buckets that are variably sized, based on where they fall in the distribution. Specifically, it uses smaller buckets at the extremes of the distribution and larger buckets in the middle.

So, it might look something like this:

A modified histogram showing how variably sized buckets that are smaller at the extremes, like what the TDigest algorithm uses, can still represent the data (Note: for illustration purposes, not to scale.)

This histogram structure with variable-sized buckets optimizes for different things than UDDSketch. Specifically, it takes advantage of the idea that when you’re trying to understand the distribution, you likely care more about fine distinctions between extreme values than about the middle of the range.

For example, I usually care a lot about distinguishing the 5th percentile from the 1st or the 95th from the 99th, while I don’t care as much about distinguishing between the 50th and the 55th percentile.

The distinctions in the middle are less meaningful and interesting than the distinctions at the extremes. (Caveat: the TDigest algorithm is a bit more complex than this, and this doesn’t completely capture its behavior, but we’re trying to give a general gist of what’s going on. If you want more information, we recommend this paper).

Using advanced approximation methods in TimescaleDB hyperfunctions

So far in this post, we’ve only used the general-purpose percentile_agg aggregate. It uses the UDDSketch algorithm under the hood and is a good starting point for most users.

We’ve also provided separate uddsketch and tdigest aggregates to allow for more customizability.

Each takes the number of buckets as their first argument (which determines the size of the internal data structure), and uddsketch also has an argument for the target maximum relative error.

We can use the normal approx_percentile accessor function just as we used with percentile_agg, so, we could compare median estimations like so:

SELECT 
    approx_percentile(0.5, uddsketch(200, 0.001, response_time)) as median_udd,
    approx_percentile(0.5, tdigest(200, response_time)) as median_tdig
FROM responses;

Both of them also work with the approx_percentile_rank hyperfunction we discussed above.

If we wanted to see where 1000 would fall in our distribution, we could do something like this:

SELECT 
    approx_percentile_rank(1000, uddsketch(200, 0.001, response_time)) as rnk_udd,
    approx_percentile_rank(1000, tdigest(200, response_time)) as rnk_tdig
FROM responses;

In addition, each of the approximations have some accessors that only work with their items based on the approximation structure.

For instance, uddsketch provides an error accessor function. This will tell you the actual guaranteed maximum relative error based on the values that the uddsketch saw.

The UDDSketch algorithm guarantees a maximum relative error, while the T-Digest algorithm does not, so error only works with uddsketch (and percentile_agg because it uses uddsketch algorithm under the hood).

This error guarantee is one of the main reasons we chose it as the default, because error guarantees are useful for determining whether you’re getting a good approximation.

Tdigest, on the other hand, provides min_val & max_val accessor functions because it biases its buckets to the extremes and can provide the exact min and max values at no extra cost. Uddsketch can’t provide that.

You can call these other accessors like so:

SELECT 
    approx_percentile(0.5, uddsketch(200, 0.001, response_time)) as median_udd,
    error(uddsketch(200, 0.001, response_time)) as error_udd,
    approx_percentile(0.5, tdigest(200, response_time)) as median_tdig,
    min_val(tdigest(200, response_time)) as min,
    max_val(tdigest(200, response_time)) as max
FROM responses;

As we discussed in the last post about two-step aggregates, calls to all of these aggregates are automatically deduplicated and optimized by PostgreSQL so that you can call multiple accessors with minimal extra cost.

They also both have rollup functions defined for them, so you can re-aggregate when they’re used in continuous aggregates or regular queries.

(Note: tdigest rollup can introduce some additional error or differences compared to calling the tdigest on the underlying data directly. In most cases, this should be negligible and would often be comparable to changing the order in which the underlying data was ingested.)

We’ve provided a few of the tradeoffs and differences between the algorithms here, but we have a longer discussion in the docs that can help you choose. You can also start with the default percentile_agg and then experiment with different algorithms and parameters on your data to see what works best for your application.

Wrapping it up

We’ve provided a brief overview of percentiles, how they can be more informative than more common statistical aggregates like average, why percentile approximations exist, and a little bit of how they generally work and within TimescaleDB hyperfunctions.

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub, after which you’ll be able to use percentile approximation and other hyperfunctions.

We’re always looking for feedback on what to build next and would love to know how you’re using hyperfunctions, problems you want to solve, or things you think should - or could - be simplified to make analyzing time-series data in SQL that much better. (To contribute feedback, comment on an open issue or in a discussion thread in GitHub.)

Function pipelines: Building functional programming into PostgreSQL using custom operators

davidkohn88 — Wed, 24 Nov 2021 21:29:43 +0000

Function pipelines: why are they useful?
How we built function pipelines without forking PostgreSQL
A custom data type: the timevector
A custom operator: ->
Custom functions: pipeline elements
timevector transforms
timevector finalizers
Aggregate accessors and mutators
Next steps

We are announcing function pipelines, a new capability that introduces functional programming concepts inside PostgreSQL (and SQL) using custom operators.

Function pipelines radically improve the developer ergonomics of analyzing data in PostgreSQL and SQL, by applying principles from functional programming and popular tools like Python’s Pandas and PromQL.

At Timescale our mission is to serve developers worldwide, and enable them to build exceptional data-driven products that measure everything that matters: e.g., software applications, industrial equipment, financial markets, blockchain activity, user actions, consumer behavior, machine learning models, climate change, and more.

We believe SQL is the best language for data analysis. We’ve championed the benefits of SQL for several years, even back when many were abandoning the language for custom domain-specific languages. And we were right - SQL has resurged and become the universal language for data analysis, and now many NoSQL databases are adding SQL interfaces to keep up.

But SQL is not perfect, and at times can get quite unwieldy. For example,

SELECT device id, 
    sum(abs_delta) as volatility
FROM (
    SELECT device_id, 
        abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts))
            as abs_delta 
    FROM measurements
    WHERE ts >= now() - '1 day'::interval) calc_delta
GROUP BY device_id;

Pop quiz: What does this query do?

Even if you are a SQL expert, queries like this can be quite difficult to read - and even harder to express. Complex data analysis in SQL can be hard.

Function pipelines let you express that same query like this:

SELECT device id, 
    sum(abs_delta) as volatility
FROM (
    SELECT device_id, 
        abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts))
            as abs_delta 
    FROM measurements
    WHERE ts >= now() - '1 day'::interval) calc_delta
GROUP BY device_id;

Now it is much clearer what this query is doing. It:

Gets the last day’s data from the measurements table, grouped by device_id
Sorts the data by the time column
Calculates the delta (or change) between values
Takes the absolute value of the delta
And then takes the sum of the result of the previous steps

Function pipelines improve your own coding productivity, while also making your SQL code easier for others to comprehend and maintain.

Inspired by functional programming languages, function pipelines enable you to analyze data by composing multiple functions, leading to a simpler, cleaner way of expressing complex logic in PostgreSQL.

And the best part: we built function pipelines in a way that is fully PostgreSQL compliant - we did not change any SQL syntax - meaning that any tool that speaks PostgreSQL will be able to support data analysis using function pipelines.

How did we build this? By taking advantage of the incredible extensibility of PostgreSQL, in particular: custom types, custom operators, and custom functions.

In our previous example, you can see the key elements of function pipelines:

Custom data types: in this case, the timevector, which is a set of (time, value) pairs.
Custom operator: ->, used to compose and apply function pipeline elements to the data that comes in.
And finally, custom functions: called pipeline elements. Pipeline elements can transform and analyze timevectors (or other data types) in a function pipeline. For this initial release, we’ve built 60 custom functions! (Full list here).

We’ll go into more detail on function pipelines in the rest of this post, but if you just want to get started as soon as possible, the easiest way to try function pipelines is through a fully managed Timescale Cloud service. Try it for free (no credit card required) for 30 days.

Function pipelines are pre-loaded on each new database service on Timescale Cloud, available immediately - so after you’ve created a new service, you’re all set to use them!

If you prefer to manage your own database instances, you can install the timescaledb_toolkit into your existing PostgreSQL installation, completely for free.

We’ve been working on this capability for a long time, but in line with our belief of “move fast but don’t break things”, we’re initially releasing function pipelines as an experimental feature - and we would absolutely love to get your feedback. You can open an issue or join a discussion thread in GitHub (And, if you like what you see, GitHub ⭐ are always welcome and appreciated too!).

We’d also like to take this opportunity to give a huge shoutout to pgx, the Rust-based framework for building PostgreSQL extensions it handles a lot of the heavy lifting for this project. We have over 600 custom types, operators, and functions in the timescaledb_toolkit extension at this point; managing this without pgx (and the ease of use that comes from working with Rust) would be a real bear of a job.

Function pipelines: why are they useful?

In the Northern hemisphere (where most of Team Timescale sits, including your authors), it is starting to get cold at this time of the year.

Now imagine a restaurant in New York City whose owners care about their customers and their customers’ comfort. And you are working on an IoT product designed to help small businesses like these owners minimize their heating bill while maximizing their customers happiness. So you install two thermometers, one at the front measuring the temperature right by the door, and another at the back of the restaurant.

Now, as many of you may know (if you’ve ever had to sit by the door of a restaurant in the fall or winter), when someone enters, the temperature drops - and once the door is closed, the temperature warms back up. The temperature at the back of the restaurant will vary much less than at the front, right by the door. And both of them will drop slowly down to a lower set point during non-business hours and warm back up sometime before business hours based on the setpoints on our thermostat. So overall we’ll end up with a graph that looks something like this:

A graph of the temperature at the front (near the door) and back. The back is much steadier, while the front is more volatile. Graph is for illustrative purposes only, data is fabricated. No restaurants or restaurant patrons were harmed in the making of this post.

As we can see, the temperature by the front door varies much more than at the back of the restaurant. Another way to say this is the temperature by the front door is more volatile. Now, the owners of this restaurant want to measure this because frequent temperature changes means uncomfortable customers.

In order to measure volatility, we could first subtract each point from the point before to calculate a delta. If we add this up directly, large positive and negative deltas will cancel out. But, we only care about the magnitude of the delta, not its sign - so what we really should do is take the absolute value of the delta, and then take the total sum of the previous steps.

We now have a metric that might help us measure customer comfort, and also the efficacy of different weatherproofing methods (for example, adding one of those little vestibules that acts as a windbreak).

To track this, we collect measurements from our thermometers and store them in a table:

CREATE TABLE measurements(
    device_id BIGINT,
    ts TIMESTAMPTZ,
    val DOUBLE PRECISION
);

The device_id identifies the thermostat, ts the time of reading and val the temperature.

Using the data in our measurements table, let’s look at how we calculate volatility using function pipelines.

Note: Because all of the function pipeline features are still experimental, they exist in the toolkit_experimental schema. Before running any of the SQL code in this post you will need to set your search_path to include the experimental schema as we do in the example below, we won’t repeat this throughout the post so as not to distract.

set search_path to toolkit_experimental, public; --still experimental, so do this to make it easier to read

SELECT device_id, 
    timevector(ts, val) -> sort() -> delta() -> abs() -> sum() 
        as volatility
FROM measurements
WHERE ts >= now()-'1 day'::interval
GROUP BY device_id;

And now we have the same query that we used as our example in the introduction.

In this query, the function pipeline
timevector(ts, val) -> sort() -> delta() -> abs() -> sum() succinctly expresses the following operations:

Create timevectors (more detail on this later) out of the ts and val columns
Sort each timevector by the time column
Calculate the delta (or change) between each pair in the timevector by subtracting the previous val from the current
Take the absolute value of the delta
Take the sum of the result from the previous steps

The FROM, WHERE and GROUP BY clauses do the rest of the work telling us:

We’re getting data FROM the measurements table
WHERE the ts, or timestamp column, contains values over the last day
Showing one pipeline output per device_id (the GROUP BY column)

As we noted before, if you were to do this same calculation using SQL and PostgreSQL functionality, your query would look like this:

SELECT device id, 
sum(abs_delta) as volatility
FROM (
    SELECT 
        abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts) ) 
            as abs_delta 
    FROM measurements
    WHERE ts >= now() - '1 day'::interval) calc_delta
GROUP BY device_id;

This does the same 5 steps as the above, but is much harder to understand, because we have to use a window function and aggregate the results - but also, because aggregates are performed before window functions, we need to actually execute the window function in a subquery.

As we can see, function pipelines make it significantly easier to comprehend the overall analysis of our data. There’s no need to completely understand what’s going on in these functions just yet, but for now it’s enough to understand that we’ve essentially implemented a small functional programming language inside of PostgreSQL. You can still use all of the normal, expressive SQL you’ve come to know and love. Function pipelines just add new tools to your SQL toolbox that make it easier to work with time-series data.

Some avid SQL users might find the syntax a bit foreign at first, but for many people who work in other programming languages, especially using tools like Python’s Pandas Package, this type of successive operation on data sets will feel natural.

And again, this is still fully PostgreSQL compliant: We introduce no changes to the parser or anything that should break compatibility with PostgreSQL drivers.

How we built function pipelines without forking PostgreSQL

We built function pipelines- without modifying the parser or anything that would require a fork of PostgreSQL- by taking advantage of three of the many ways that PostgreSQL enables extensibility: custom types, custom functions, and custom operators.

Custom data types, starting with the timevector, which is a set of (time, value) pairs
A custom operator: ->, which is used to compose and apply function pipeline elements to the data that comes in.
Custom functions, called pipeline elements, which can transform and analyze timevectors (or other data types) in a function pipeline (with 60 functions in this initial release)

We believe that new idioms like these are exactly what PostgreSQL was meant to enable. That’s why it has supported custom types, functions and operators from its earliest days. (And is one of the many reasons why we love PostgreSQL.)

A custom data type: the timevector

A timevector is a collection of (time, value) pairs. As of now, the times must be TIMESTAMPTZs and the values must be DOUBLE PRECISION numbers. (But this may change in the future as we continue to develop this data type. If you have ideas/input, please file feature requests on GitHub explaining what you’d like!)

You can think of the timevector as something like this:

A depiction of a `timevector`.

One of the first questions you might ask is: how does a timevector relate to time-series data? (If you want to know more about time-series data, we have a great blog post on that).

Let’s consider our example from above, where we were talking about a restaurant that was measuring temperatures, and we had a measurements table like so:

CREATE TABLE measurements(
    device_id BIGINT,
    ts TIMESTAMPTZ,
    val DOUBLE PRECISION
);

In this example, we can think of a single time-series dataset as all historical and future time and temperature measurements from a device.

Given this definition, we can think of a timevector as a finite subset of a time-series dataset. The larger time-series dataset may extend back into the past and it may extend into the future, but the timevector is bounded.

A `timevector` is a finite subset of a time-series and contains all the `(time, value)` pairs in some region of the time-series.

In order to construct a timevector from the data gathered from a thermometer, we use a custom aggregate and pass in the columns we want to become our (time, value) pairs. We can use the WHERE clause to define the extent of the timevector (i.e., the limits of this subset), and the GROUP BY clause to provide identifying information about the time-series that’s represented.

Building on our example, this is how we construct a timevector for each thermometer in our dataset:

SELECT device_id, 
timevector(ts, val)
FROM measurements
WHERE ts >= now()-'1 day'::interval
GROUP BY device_id;

But a timevector doesn't provide much value by itself. So now, let’s also consider some complex calculations that we can apply to the timevector, starting with a custom operator used to apply these functions.

A custom operator: ->

In function pipelines, the -> operator is used to apply and compose multiple functions, in an easy to write and read format.

Fundamentally, -> means: “apply the operation on the right to the inputs on the left”, or, more simply “do the next thing”.

We created a general-purpose operator for this because we think that too many operators meaning different things can get very confusing and difficult to read.

One thing that you’ll notice about the pipeline elements is that the arguments are in an unusual place in a statement like:

SELECT device_id, 
 timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility
FROM measurements
WHERE ts >= now()-'1 day'::interval
GROUP BY device_id;

It appears (from the semantics) that the timevector(ts, val) is an argument to sort(), the resulting timevector is an argument to delta() and so on.

The thing is that sort() (and the others) are regular function calls; they can’t see anything outside of their parentheses and don’t know about anything to their left in the statement; so we need a way to get the timevector into the sort() (and the rest of the pipeline).

The way we solved this is by taking advantage of one of the same fundamental computing insights that functional programming languages use: code and data are really the same thing.

Each of our functions returns a special type that describes the function and its arguments. We call these types pipeline elements (more later).

The -> operator then performs one of two different types of actions depending on the types on its right and left sides. It can either:

Apply a pipeline element to the left hand argument - perform the function described by the pipeline element on the incoming data type directly.
Compose pipeline elements into a combined element that can be applied at some point in the future (this is an optimization that allows us to apply multiple elements in a “nested” manner so that we don’t perform multiple unnecessary passes).

The operator determines the action to perform based on its left and right arguments.

Let’s look at ourtimevector from before: timevector(ts, val) -> sort() -> delta() -> abs() -> sum(). If you remember from before, I noted that this function pipeline performs the following steps:

Create timevectors out of the ts and val columns
Sort it by the time column
Calculate the delta (or change) between each pair in the timevector by subtracting the previous val from the current
Take the absolute value of the delta
Take the sum of the result from the previous steps

And logically, at each step, we can think of the timevector being materialized and passed to the next step in the pipeline.

However, while this will produce a correct result, it’s not the most efficient way to compute this. Instead, it would be more efficient to compute as much as possible in a single pass over the data.

In order to do this, we allow not only the apply operation, but also the compose operation. Once we’ve composed a pipeline into a logically equivalent higher order pipeline with all of the elements we can choose the most efficient way to execute it internally. (Importantly, even if we have to perform each step sequentially, we don’t need to materialize it and pass it between each step in the pipeline so it has significantly less overhead even without other optimization).

Custom functions: pipeline elements

Now let’s discuss the third, and final, key piece that makes up function pipelines: custom functions, or as we call them, pipeline elements.

We have implemented over 60 individual pipeline elements, which fall into 4 categories (with a few subcategories):

`timevector` transforms

These elements take in a timevector and produce atimevector. They are the easiest to compose, as they produce the same type.

Example pipeline:

SELECT device_id, 
timevector(ts, val) -> sort() -> delta() -> map($$ ($value^3 + $value^2 + $value * 2) $$) -> lttb(100) 
FROM measurements

Organized by sub-category:

Unary mathematical

Simple mathematical functions applied to the value in each point in a timevector

Element	Description
`abs()`	Computes the absolute value of each value
`cbrt()`	Computes the cube root of each value
`ceil()`	Computes the first integer greater than or equal to each value
`floor()`	Computes the first integer less than or equal to each value
`ln()`	Computes the natural logarithm of each value
`log10()`	Computes the base 10 logarithm of each value
`round()`	Computes the closest integer to each value
`sign()`	Computes +/-1 for each positive/negative value
`sqrt()`	Computes the square root for each value
`trunc()`	Computes only the integer portion of each value

Binary mathematical

Simple mathematical functions with a scalar input applied to the value in each point in a timevector.

Element	Description
`add(N)`	Computes each value plus `N`
`div(N)`	Computes each value divided by `N`
`logn(N)`	Computes the logarithm base `N` of each value
`mod(N)`	Computes the remainder when each number is divided by `N`
`mul(N)`	Computes each value multiplied by `N`
`power(N)`	Computes each value taken to the `N` power
`sub(N)`	Computes each value less `N`

Compound transforms

Transforms involving multiple points inside of a timevector - see here for more information.

Element	Description
`delta()`	Subtracts each value from the previous`
`fill_to(interval, fill_method)`	Fills gaps larger than `interval` with points at `interval` from the previous using `fill_method`
`lttb(resolution)`	Downsamples a `timevector` using the largest triangle three buckets algorithm at `resolution, requires sorted input.
`sort()`	Sorts the `timevector` by the `time` column ascending

Lambda elements

These elements use lambda expressions, which allows the user to write small functions to be evaluated over each point in a timevector.

Lambda expressions can return a DOUBLE PRECISION value like $$ $value^2 + $value + 3 $$. They can return a BOOL like $$ $time > ‘2020-01-01’t $$ . They can also return a (time, value) pair like $$ ($time + ‘1 day’i, sin($value) * 4)$$. You can apply them using the elements below:

Element	Description
`filter(lambda (bool) )`	Removes points from the `timevector` where the lambda expression evaluates to `false`
`map(lambda (value) )`	Applies the lambda expression to all the values in the `timevector`
`map(lambda (time, value) )`	Applies the lambda expression to all the times and values in the `timevector`

`timevector` finalizers

These elements end the timevector portion of a pipeline, they can either help with output or produce an aggregate over the entire timevector. They are an optimization barrier to composition as they (usually) produce types other than timevector.

Example pipelines:

SELECT device_id, 
timevector(ts, val) -> sort() -> delta() -> unnest()
FROM measurements

SELECT device_id, 
timevector(ts, val) -> sort() -> delta() -> time_weight()
FROM measurements

Finalizer pipeline elements organized by sub-category:

`timevector` output

These elements help with output, and can produce a set of (time, value) pairs or a Note: this is an area where we’d love further feedback, are there particular data formats that would be especially useful for, say graphing that we can add? File an issue in our GitHub!

Element	Description
`unnest( )`	Produces a set of `(time, value)` pairs. You can wrap and expand as a composite type to produce separate columns `(pipe -> unnest()).*`
`materialize()`	Materializes a `timevector` to pass to an application or other operation directly, blocks any optimizations that would materialize it lazily.

`timevector` aggregates

Aggregate all the points in a timevector to produce a single value as a result.

Element	Description
`average()`	Computes the average of the values in the `timevector`
`couter_agg()`	Computes the counter_agg aggregate over the times and values in the `timevector`
`stats_agg()`	Computes a range of statistical aggregates and returns a `1DStatsAgg` over the values in the `timevector`
`sum()`	Computes the sum of the values in the `timevector`
`num_vals()`	Counts the points in the `timevector`

Aggregate accessors and mutators

These function pipeline elements act like the accessors that I described in our previous post on aggregates. You can use them to get a value from the aggregate part of a function pipeline like so:

SELECT device_id, 
timevector(ts, val) -> sort() -> delta() -> stats_agg() -> variance() 
FROM measurements

But these don’t just work on timevectors - they also work on a normally produced aggregate as well.

When used instead of normal function accessors and mutators they can make the syntax more clear by getting rid of nested functions like:

SELECT approx_percentile(0.5, percentile_agg(val)) 
FROM measurements

Instead, we can use the arrow accessor to convey the same thing:

SELECT percentile_agg(val) -> approx_percentile(0.5) 
FROM measurements

By aggregate family:

Counter aggregates

Counter aggregates deal with resetting counters, (and were stabilized in our 1.3 release this week!). Counters are a common type of metric in the application performance monitoring and metrics world. All values have resets accounted for. These elements must have a CounterSummary to their left when used in a pipeline, from a counter_agg() aggregate or pipeline element.

Element	Description
`counter_zero_time()`	The time at which the counter value is predicted to have been zero based on the least squares fit of the points input to the `CounterSummary`(x intercept)
`corr()`	The correlation coefficient of the least squares fit line of the adjusted counter value.
`delta()`	Computes the last - first value of the counter
`extapolated_delta(method)`	Computes the delta extrapolated using the provided method to bounds of range. Bounds must have been provided in the aggregate or a `with_bounds` call
`idelta_left()` / `idelta_right()`	Computes the instantaneous difference between the second and first points (left) or last and next-to-last points (right)
`intercept()`	The y-intercept of the least squares fit line of the adjusted counter value.
`irate_left()` / `irate_right()`	Computes the instantaneous rate of change between the second and first points (left) or last and next-to-last points (right)
`num_changes()`	Number of times the counter changed values.
`num_elements()`	Number of items - any with the exact same time will have been counted only once.
`num_changes()`	Number of times the counter reset.
`slope()`	The slope of the least squares fit line of the adjusted counter value.
`with_bounds(range)`	Applies bounds using the `range` (a `TSTZRANGE`) to the `CounterSummary` if they weren’t provided in the aggregation step

Percentile approximation

These aggregate accessors deal with percentile approximation. For now we’ve only implemented them for percentile_agg and uddsketch based aggregates. We have not yet implemented them for tdigest.

Element	Description
`approx_percentile(p)`	The approximate value at percentile `p`
`approx_percentile_rank(v)`	The approximate percentile a value `v` would fall in
`error()`	The maximum relative error guaranteed by the approximation
`mean()`	The exact average of the input values.
`num_vals()`	The number of input values

Statistical aggregates

These aggregate accessors add support for common statistical aggregates (and were stabilized in our 1.3 release this week!). These allow you to compute and rollup() common statistical aggregates like average, stddev and more advanced ones like skewness as well as 2 dimensional aggregates like slope and covariance. Because there are both 1D and 2D versions of these, the accessors can have multiple forms, for instance, average() calculates the average on a 1D aggregate while average_y() & average_x() do so on each dimension of a 2D aggregate.

Element	Description
`average() / average_y() / average_x()`	The average of the values.
`corr()`	The correlation coefficient of the least squares fit line.
`covariance(method)`	The covariance of the values using either `population` or `sample` method.
`determination_coeff()`	The determination coefficient (aka R squared) of the values.
`kurtosis(method) / kurtosis_y(method) / kurtosis_x(method)`	The kurtosis (4th moment) of the values using either `population` or `sample` method.
`intercept()`	The intercept of the least squares fit line.
`num_vals()`	The number of (non-null) values seen.
`sum() / sum_x() / sum_y()`	The sum of the values seen.
`skewness(method) / skewness_y(method) / skewness_x(method)`	The skewness (3rd moment) of the values using either `population` or `sample` method.
`slope()`	The slope of the least squares fit line.
`stddev(method) / stddev_y(method) / stddev_x(method)`	The standard deviation of the values using either `population` or `sample` method.
`variance(method) / variance_y(method) / variance_x(method)`	The variance of the values using either `population` or `sample` method.
`x_intercept()`	The x intercept of the least squares fit line.

Time weighted averages

The average() accessor may be called on the output of a time_weight() like so:

SELECT time_weight('Linear', ts, val) -> average()  FROM measurements;

Approximate count distinct (Hyperloglog)

This is an approximation for distinct counts that was stabilized in our 1.3 release! The distinct_count() accessor may be called on the output of a hyperloglog() like so:

SELECT hyperloglog(device_id) -> distinct_count() FROM measurements;

Next steps

We hope this post helped you understand how function pipelines leverage PostgreSQL extensibility to offer functional programming concepts in a way that is fully PostgreSQL compliant. And how function pipelines can improve the ergonomics of your code making it easier to write, read, and maintain.

You can try function pipelines today with a fully-managed Timescale Cloud service (no credit card required, free for 30 days). Function pipelines are available now on every new database service on Timescale Cloud, so after you’ve created a new service, you’re all set to use them!

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub for free, after which you’ll be able to use function pipelines.

We love building in public. You can view our upcoming roadmap on GitHub for a list of proposed features, as well as features we’re currently implementing and those that are available to use today. We also welcome feedback from the community (it helps us prioritize the features users really want). To contribute feedback, comment on an open issue or in a discussion thread in GitHub.

How PostgreSQL aggregation works and how it inspired our hyperfunctions’ design

davidkohn88 — Mon, 30 Aug 2021 09:04:30 +0000

A primer on PostgreSQL aggregation (through pictures)
Two-step aggregation in TimescaleDB hyperfunctions
Why we use the two-step aggregate design pattern
Two-step aggregation + continuous aggregates in TimescaleDB
An example of how the two-step aggregate design impacts hyperfunctions’ code
Summing it up

Get a primer on PostgreSQL aggregation, how PostgreSQL’s implementation inspired us as we built TimescaleDB hyperfunctions and its integrations with advanced TimescaleDB features – and what this means for developers.

At Timescale, our goal is to always focus on the developer experience, and we take great care to design our products and APIs to be developer-friendly. We believe that when our products are easy to use and accessible to a wide range of developers, we enable them to solve a breadth of different problems – and thus build solutions that solve big problems.

This focus on developer experience is why we made the decision early in the design of TimescaleDB to build on top of PostgreSQL. We believed then, as we do now, that building on the world’s fastest-growing database would have numerous benefits for our users.

Perhaps the biggest of these advantages is developer productivity: developers can use the tools and frameworks they know and love and bring all of their SQL skills and expertise.

Today, there are nearly three million active TimescaleDB databases running mission-critical time-series workloads across industries. Time-series data comes at you fast, sometimes generating millions of data points per second (read more about time-series data). Because of this volume and rate of information, time-series data is complex to query and analyze. We built TimescaleDB as a purpose-built relational database for time-series to reduce that complexity so that developers can focus on their applications.

So, we’re built with developer experience at our core, and we’ve continually released functionality to further this aim, including continuous aggregates, user-defined actions, informational views, and most recently, TimescaleDB hyperfunctions: a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code.

To ensure we stay focused on developer experience as we plan new hyperfunctions features, we established a set of “design constraints” that guide our development decisions. Adhering to these guidelines ensures our APIs:

Work within the SQL language (no new syntax, just functions and aggregates)
Intuitive for new and experienced SQL users
Useful for just a few rows of data and high-performance with billions of rows
Play nicely with all TimescaleDB features, and ideally, makes them more useful to users
Make fundamental things simple to make more advanced analyses possible

What does this look like in practice? In this post, I explain how these constraints led us to adopt two-step aggregation throughout TimescaleDB hyperfunctions, how two-step aggregates interact with other TimescaleDB features, and how PostgreSQL's internal aggregation API influenced our implementation.

When we talk about two-step aggregation, we mean the following calling convention:
-- or
SELECT approx_percentile(0.5, percentile_agg(value)) as median FROM bar;
" width="800" height="149">

Where we have an inner aggregate call:

And an outer accessor call:

We chose this design pattern over the more common - and seemingly simpler - one-step aggregation approach, in which a single function encapsulates the behavior of both the inner aggregate and outer accessor:

Read on for more on why the one-step aggregate approach quickly breaks down as you start doing more complex things (like composing functions into more advanced queries) and how, under the hood, almost all PostgreSQL aggregates do a version of two-step aggregation. You’ll learn how the PostgreSQL implementation inspired us as we built TimescaleDB hyperfunctions, continuous aggregates, and other advanced features – and what this means for developers.

If you’d like to get started with hyperfunctions right away, create your free trial account and start analyzing 🔥. (TimescaleDB hyperfunctions are pre-installed on every Timescale Forge instance, our hosted cloud-native relational time-series data platform).

A primer on PostgreSQL aggregation (through pictures)

When I first started learning about PostgreSQL 5 or 6 years ago (I was an electrochemist, and dealing with lots of battery data, as mentioned in my last post on time-weighted averages), I ran into some performance issues. I was trying to better understand what was going on inside the database in order to improve its performance – and that’s when I found Bruce Momjian’s talks on PostgreSQL Internals Through Pictures. Bruce is well known in the community for his insightful talks (and his penchant for bow ties), and his sessions were a revelation for me.

They’ve served as a foundation for my understanding of how PostgreSQL works ever since. He explained things so clearly, and I’ve always learned best when I can visualize what’s going on, so the “through pictures” part really helped - and stuck with - me.

So this next bit is my attempt to channel Bruce by explaining some PostgreSQL internals through pictures. Cinch up your bow ties and get ready for some learnin’.

The author pays homage to Bruce Momjian (and looks rather pleased with himself because he’s managed to tie a bow tie on the first try).

PostgreSQL aggregates vs. functions

We have written about how we use custom functions and aggregates to extend SQL, but we haven’t exactly explained the difference between them.

The fundamental difference between an aggregate function and a “regular” function in SQL is that an aggregate produces a single result from a group of related rows, while a regular function produces a result for each row:

In SQL, aggregates produce a result from multiple rows, while functions produce a result per row.

This is not to say that a function can’t have inputs from multiple columns; they just have to come from the same row.

Another way to think about it is that functions often act on rows, whereas aggregates act on columns. To illustrate this, let’s consider a theoretical table ’foo’ with two columns:

CREATE TABLE foo(
    bar DOUBLE PRECISION,
    baz DOUBLE PRECISION);

And just a few values, so we can easily see what’s going on:

INSERT INTO foo(bar, baz) VALUES (1.0, 2.0), (2.0, 4.0), (3.0, 6.0);

The function greatest() will produce the largest of the values in columns bar and baz for each row:

SELECT greatest(bar, baz) FROM foo;

 greatest 
----------
        2
        4
        6

Whereas the aggregate max() will produce the largest value from each column:

SELECT max(bar) as bar_max, max(baz) as baz_max FROM foo;

 bar_max | baz_max 
----|--------
       3 |       6
```



Using the above data, here’s a picture of what happens when we aggregate something: 
<figure>
<img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hwomr9qpp8o3p461swg4.jpg" alt="A diagram showing how the statement: `SELECT max(bar) FROM foo;` works: multiple rows with values of “bar equal to” 1.0, 2.0, and 3.0, go through the `max(bar)` aggregate to ultimately produce a result of 3.0. " style="width:100%">
<figcaption align = "center">The `max()` aggregate gets the largest value from multiple rows.</figcaption>
</figure>
<p>
The aggregate takes inputs from multiple rows and produces a single result. That’s the main difference between it and a function, but how does it do that? Let’s look at what it’s doing under the hood.

### Aggregate internals: row-by-row
Under the hood, aggregates in PostgreSQL work row-by-row. But, then how does an aggregate know anything about the previous rows? 

Well, an aggregate stores some state about the rows it has previously seen, and as the database sees new rows, it updates that internal state. 

For the `max()` aggregate we’ve been discussing, the internal state is simply the largest value we’ve collected so far. 

Let’s take this step-by-step. 

When we start, our internal state is `NULL` because we haven’t seen any rows yet:

![Flowchart arrow diagram representing the max open parens bar close parens aggregate, with three rows below the arrow where bar is equal to 1.0, 2.0, and 3.0, respectively. There is a box in the arrow in which the state is equal to NULL. ](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/edvpx1w1ga0qori6dwul.jpg)

Then, we get our first row in: 
![The same flowchart arrow diagram, except that row one, with bar equal to 1.0, has moved from below the arrow_into_ the arrow. ](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a7q3ejwkxieanvy4idem.jpg)

Since our state is `NULL`, we initialize it to the first value we see:
![The same flowchart diagram, except that row one has moved _out_ of the arrow, and the state has been updated from NULL to the 1.0, row one’s value.](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6izuxd45sevfvwjvh8ct.jpg)

Now, we get our second row: 
![The same flowchart diagram, except that row two has moved into the arrow representing the max aggregate. ](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/37jr6glndj13fdim3ehr.jpg)

And we see that the value of bar (2.0) is greater than our current state (1.0), so we update the state:
![The same diagram, except that row two has moved out of the max aggregate, and the state has been updated to the largest value (the value of row two, 2.0). ](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qvr3udolmos6v01wo59e.jpg)

Then, the next row comes into the aggregate:
![The same diagram, except that the row three has moved into the arrow representing the max aggregate. ](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7pjbniuy0xrdqsg5ufja.jpg)

We compare it to our current state, take the greatest value, and update our state: 
![The same diagram, expect that row three has moved out of the max aggregate, and the state has been updated to the largest value, the value of the third row, 3.0.](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kfkp9q9wf5gtey3k752b.jpg)

Finally, we don’t have any more rows to process, so we output our result:
![The same diagram, now noting that there are “no more rows” to process, and including a final result, 3.0, being output at the end of the arrow.](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ye5u1cdnqk3tslrpd225.jpg)

So, to summarize, each row comes in, gets compared to our current state, and then the state gets updated to reflect the new greatest value. Then the next row comes in, and we repeat the process until we’ve processed all our rows and output the result.
<figure>
<img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hbqf9uvopm9xq9skfwwd.gif" alt="A GIF depicting the previous diagrams, one after the other, as the rows move through the aggregate." style="width:100%">
<figcaption align = "center">The max aggregate aggregation process, told in GIFs.</figcaption>
</figure>
<p>
There’s a name for the function that processes each row and updates the internal state: the **[state transition function](https://www.postgresql.org/docs/current/sql-createaggregate.html)** (or just “transition function” for short.) The transition function for an aggregate takes the current state and the value from the incoming row as arguments and produces a new state. 

It’s defined like this, where `current_value` represents values from the incoming row, `current_state` represents the current aggregate state built up over the previous rows (or NULL if we haven’t yet gotten any), and `next_state` represents the output after analyzing the incoming row:


```SQL
next_state = transition_func(current_state, current_value)
```



### Aggregate internals: composite state
So, the `max()` aggregate has a straightforward state that contains just one value (the largest we’ve seen). But not all aggregates in PostgreSQL have such a simple state. 

Let’s consider the aggregate for average (`avg`):


```SQL
SELECT avg(bar) FROM foo;
```



To refresh, an average is defined as: 


 $a vg (x) = \frac{s u m ( x )}{co u n t ( x )}$ 

To calculate it, we store the sum and the count as our internal state and update our state as we process rows: 

<figure>

<img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pjuxskwee16306gnl48o.gif" alt="A GIF of the aggregation process for the statement </span><span class="k">SELECT</span> <span class="k">avg</span><span class="p">(</span><span class="n">bar</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">foo</span><span class="p">;</span><span class="nv">, with diagrams similar to the previous. Three rows with values of bar equal to 1.0, 2.0, and 3.0 go through the aggregate, and the transition function updates the state, which has two values, each starting NULL, the sum is updated at each step by adding the value of the incoming row, and the count is incremented.


" style="width:100%">

<figcaption align = "center">The </span><span class="k">avg</span><span class="p">()</span><span class="nv"> aggregation process, told in GIFs. For </span><span class="k">avg</span><span class="p">()</span><span class="nv">, the transition function must update a more complex state since the sum and count are stored separately at each aggregation step.</figcaption>

</figure>

<p>

But, when we’re ready to output our result for </span><span class="k">avg</span><span class="nv">, we need to divide </span><span class="k">sum</span><span class="nv"> by </span><span class="k">count</span><span class="nv">:

<figure>

<img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/to9t3tjqp6bj8rpol45m.jpg" alt="An arrow flowchart diagram similar to those before, showing the end state of the avg aggregate. The rows have moved through the aggregate, and the state is 6.0 - the sum and three - the count. There are then some question marks and an end result of 2.0." style="width:100%">

<figcaption align = "center"> For some aggregates, we can output the state directly – but for others, we need to perform an operation on the state before calculating our final result. </figcaption>

</figure>

<p>

There’s another function inside the aggregate that performs this calculation: the final function. Once we’ve processed all the rows, the final function takes the state and does whatever it needs to produce the result. 

It’s defined like this, where </span><span class="n">final_state</span><span class="nv"> represents the output of the transition function after it has processed all the rows:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">result</span> <span class="o">=</span> <span class="n">final_func</span><span class="p">(</span><span class="n">final_state</span><span class="p">)</span>

<span class="nv">`</span><span class="se">

And, through pictures:

<figure>

<img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/f8ecxml9yt9lwxa28dih.gif" alt="A GIF that starts the same as the previous GIF, the avg aggregate state is updated as rows pass through the aggregate. Once all the rows are processed, a final function step divides the final sum - 6.0 by the final count - 3 and outputs the result - 2.0. " style="width:100%">

<figcaption align = "center"> How the average aggregate works, told in GIFs. Here, we’re highlighting the role of the final function. </figcaption>

</figure>

<p>


To summarize: as an aggregate scans over rows, its transition function updates its internal state. Once the aggregate has scanned all of the rows, its final function produces a result, which is returned to the user.

  
  
  Improving the performance of aggregate functions


One interesting thing to note here: the transition function is called many, many more times than the final function: once for each row, whereas the final function is called once per group of rows. 

Now, the transition function isn’t inherently more expensive than the final function on a per-call basis – but because there are usually orders of magnitude more rows going into the aggregate than coming out, the transition function step becomes the most expensive part very quickly. This is especially true when you have high volume time-series data being ingested at high rates; optimizing aggregate transition function calls is important for improving performance. 

Luckily, PostgreSQL already has ways to optimize aggregates.    

  
  
  Parallelization and the combine function


Because the transition function is run on each row, some enterprising PostgreSQL developers asked: what if we parallelized the transition function calculation?

Let’s revisit our definitions for transition functions and final functions:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="n">next_state</span> <span class="o">=</span> <span class="n">transition_func</span><span class="p">(</span><span class="n">current_state</span><span class="p">,</span> <span class="n">current_value</span><span class="p">)</span>

<span class="k">result</span> <span class="o">=</span> <span class="n">final_func</span><span class="p">(</span><span class="n">final_state</span><span class="p">)</span>

<span class="nv">`</span><span class="se">

We can run this in parallel by instantiating multiple copies of the transition function and handing a subset of rows to each instance. Then, each parallel aggregate will run the transition function over the subset of rows it sees, producing multiple (partial) states, one for each parallel aggregate. But, since we need to aggregate over the entire data set, we can’t run the final function on each parallel aggregate separately because they only have some of the rows. 

So, now we’ve ended up in a bit of a pickle: we have multiple partial aggregate states, and the final function is only meant to work on the single, final state - right before we output the result to the user. 

To solve this problem, we need a new type of function that takes two partial states and combines them into one so that the final function can do its work. This is (aptly) called the combine function. 

We can run the combine function iteratively over all of the partial states that are created when we parallelize the aggregate.

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="n">combined_state</span> <span class="o">=</span> <span class="n">combine_func</span><span class="p">(</span><span class="n">partial_state_1</span><span class="p">,</span> <span class="n">partial_state_2</span><span class="p">)</span>

<span class="nv">`</span><span class="se">

For instance, in </span><span class="k">avg</span><span class="nv">, the combine function will add up the counts and sums.

<figure>

<img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ei54hrl4gy30hajmboys.gif" alt="A GIF that starts the same as the previous GIF, the avg aggregate state is updated as rows pass through the aggregate. Once all the rows are processed, a final function step divides the final sum - 6.0 by the final count - 3 and outputs the result - 2.0. " style="width:100%">

<figcaption align = "center"> How parallel aggregation works, told in GIFs. Here, we’re highlighting the combine function (We’ve added a couple more rows to illustrate parallel aggregation.)

</figcaption>

</figure>

<p>


Then, after we have the combined state from all of our parallel aggregates, we run the final function and get our result.

  
  
  Deduplication <a name="deduplication"></a>


Parallelization and the combine function are one way to reduce the cost of calling an aggregate, but it’s not the only way. 

One other built-in PostgreSQL optimization that reduces an aggregate’s cost occurs in a statement like this:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">SELECT</span> <span class="k">avg</span><span class="p">(</span><span class="n">bar</span><span class="p">),</span> <span class="k">avg</span><span class="p">(</span><span class="n">bar</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="k">AS</span> <span class="n">half_avg</span> <span class="k">FROM</span> <span class="n">foo</span><span class="p">;</span>

<span class="nv">`</span><span class="se">

PostgreSQL will optimize this statement to evaluate the </span><span class="k">avg</span><span class="p">(</span><span class="n">bar</span><span class="p">)</span><span class="nv"> calculation only once and then use that result twice. 

And, if we have different aggregates with the same transition function but different final functions? PostgreSQL further optimizes by calling the transition function (the expensive part) on all the rows and then doing both final functions! Pretty neat!

Now, that’s not all that PostgreSQL aggregates can do, but it’s a pretty good tour, and it’s enough to get us where we need to go today. 



  
  
  Two-step aggregation in TimescaleDB hyperfunctions<a name="two-step-in-tsdb"></a>


In TimescaleDB, we’ve implemented the two-step aggregation design pattern for our aggregate functions. This generalizes the PostgreSQL internal aggregation API and exposes it to the user via our aggregates, accessors, and rollup functions. (In other words, each of the internal PostgreSQL functions has an equivalent function in TimescaleDB hyperfunctions.)

As a refresher, when we talk about the two-step aggregation design pattern, we mean the following convention, where we have an inner aggregate call:



And an outer accessor call:



The inner aggregate call returns the internal state, just like the transition function does in PostgreSQL aggregates. 

The outer accessor call takes the internal state and returns a result to the user, just like the final function does in PostgreSQL. 

We also have special </span><span class="k">rollup</span><span class="nv"> functions defined for each of our aggregates that work much like PostgreSQL combine functions. 

<figure>

<img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fgz5f9gw8cbanb0an8wg.jpg" alt="A table with columns labeled: the PostgreSQL internal aggregation API, Two-step aggregate equivalent, and TimescaleDB hyperfunction example. In the first row, we have the transition function equivalent to the aggregate, and the examples are time_weight() and percentile_agg(). In the second row, we have the final function, equivalent to the accessor, and the examples are average() and approx_percentile(). In the third row, we have the combine function equivalent to rollup in two-step aggregates, and the example is rollup().">

<figcaption align = "center">PostgreSQL internal aggregation APIs and their TimescaleDB hyperfunctions’ equivalent</figcaption>

</figure>

<p> 



  
  
  Why we use the two-step aggregate design pattern<a name="3"></a>


There are four basic reasons we expose the two-step aggregate design pattern to users rather than leave it as an internal structure: 


Allow multi-parameter aggregates to re-use state, making them more efficient
Cleanly distinguish between parameters that affect aggregates vs. accessors, making performance implications easier to understand and predict
Enable easy to understand rollups, with logically consistent results, in continuous aggregates and window functions (one of our most common requests on continuous aggregates)
Allow easier retrospective analysis of downsampled data in continuous aggregates as requirements change, but the data is already gone.


That’s a little theoretical, so let’s dive in and explain each one.

  
  
  Re-using state


PostgreSQL is very good at optimizing statements (as we saw earlier in this post, through pictures 🙌), but you have to give it things in a way it can understand. 

For instance, when we talked about deduplication, we saw that PostgreSQL could “figure out” when a statement occurs more than once in a query (i.e., </span><span class="k">avg</span><span class="p">(</span><span class="n">bar</span><span class="p">)</span><span class="nv">) and only run the statement a single time to avoid redundant work:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">SELECT</span> <span class="k">avg</span><span class="p">(</span><span class="n">bar</span><span class="p">),</span> <span class="k">avg</span><span class="p">(</span><span class="n">bar</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="k">AS</span> <span class="n">half_avg</span> <span class="k">FROM</span> <span class="n">foo</span><span class="p">;</span>

<span class="nv">`</span><span class="se">

This works because the </span><span class="k">avg</span><span class="p">(</span><span class="n">bar</span><span class="p">)</span><span class="nv"> occurs multiple times without variation. 

However, if I write the equation in a slightly different way and move the division inside the parentheses so that the expression </span><span class="k">avg</span><span class="p">(</span><span class="n">bar</span><span class="p">)</span><span class="nv"> doesn’t repeat so neatly, PostgreSQL can’t figure out how to optimize it:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">SELECT</span> <span class="k">avg</span><span class="p">(</span><span class="n">bar</span><span class="p">),</span> <span class="k">avg</span><span class="p">(</span><span class="n">bar</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span> <span class="k">AS</span> <span class="n">half_avg</span> <span class="k">FROM</span> <span class="n">foo</span><span class="p">;</span>

<span class="nv">`</span><span class="se">

It doesn’t know that the division is commutative, or that those two queries are equivalent. 

This is a complicated problem for database developers to solve, and thus, as a PostgreSQL user, you need to make sure to write your query in a way that the database can understand. 

Performance problems caused by equivalent statements that the database doesn’t understand are equal (or that are equal in the specific case you wrote, but not in the general case) can be some of the trickiest SQL optimizations to figure out as a user. 

Therefore, when we design our APIs, we try to make it hard for users to unintentionally write low-performance code: in other words, the default option should be the high-performance option.

For the next bit, it’ll be useful to have a simple table defined as:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">foo</span><span class="p">(</span>

    <span class="n">ts</span> <span class="n">timestamptz</span><span class="p">,</span> 

    <span class="n">val</span> <span class="nb">DOUBLE</span> <span class="nb">PRECISION</span><span class="p">);</span>

<span class="nv">`</span><span class="se">

Let’s look at an example of how we use two-step aggregation in the percentile approximation hyperfunction to allow PostgreSQL to optimize performance.

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">SELECT</span> 

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">1</span><span class="p">,</span> <span class="n">percentile_agg</span><span class="p">(</span><span class="n">val</span><span class="p">))</span> <span class="k">as</span> <span class="n">p10</span><span class="p">,</span> 

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="n">percentile_agg</span><span class="p">(</span><span class="n">val</span><span class="p">))</span> <span class="k">as</span> <span class="n">p50</span><span class="p">,</span> 

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">9</span><span class="p">,</span> <span class="n">percentile_agg</span><span class="p">(</span><span class="n">val</span><span class="p">))</span> <span class="k">as</span> <span class="n">p90</span> 

<span class="k">FROM</span> <span class="n">foo</span><span class="p">;</span>

<span class="nv">`</span><span class="se">

...is treated as the same as:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">SELECT</span> 

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">1</span><span class="p">,</span> <span class="n">pct_agg</span><span class="p">)</span> <span class="k">as</span> <span class="n">p10</span><span class="p">,</span> 

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="n">pct_agg</span><span class="p">)</span> <span class="k">as</span> <span class="n">p50</span><span class="p">,</span> 

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">9</span><span class="p">,</span> <span class="n">pct_agg</span><span class="p">)</span> <span class="k">as</span> <span class="n">p90</span> 

<span class="k">FROM</span> 

<span class="p">(</span><span class="k">SELECT</span> <span class="n">percentile_agg</span><span class="p">(</span><span class="n">val</span><span class="p">)</span> <span class="k">as</span> <span class="n">pct_agg</span> <span class="k">FROM</span> <span class="n">foo</span><span class="p">)</span> <span class="n">pct</span><span class="p">;</span>

<span class="nv">`</span><span class="se">

This calling convention allows us to use identical aggregates so that, under the hood, PostgreSQL can deduplicate calls to the identical aggregates (and is faster as a result).

Now, let’s compare this to the one-step aggregate approach. 

PostgreSQL can’t deduplicate aggregate calls here because the extra parameter in the </span><span class="n">approx_percentile</span><span class="nv"> aggregate changes with each call:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="c1">-- NB: THIS IS AN EXAMPLE OF AN API WE DECIDED NOT TO USE, IT DOES NOT WORK</span>

<span class="k">SELECT</span> 

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">1</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="k">as</span> <span class="n">p10</span><span class="p">,</span> 

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="k">as</span> <span class="n">p50</span><span class="p">,</span> 

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">9</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="k">as</span> <span class="n">p90</span> 

<span class="k">FROM</span> <span class="n">foo</span><span class="p">;</span>

<span class="nv">`</span><span class="se">

So, even though all of those functions could use the same approximation built up over all the rows, PostgreSQL has no way of knowing that. The two-step aggregation approach enables us to structure our calls so that PostgreSQL can optimize our code, and it enables developers to understand when things will be more expensive and when they won't. Multiple different aggregates with different inputs will be expensive, whereas multiple accessors to the same aggregate will be much less expensive. 

  
  
  Cleanly distinguishing between aggregate/accessor parameters


We also chose the two-step aggregate approach because some of our aggregates can take multiple parameters or options themselves, and their accessors can also take options:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">SELECT</span>

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="n">uddsketch</span><span class="p">(</span><span class="mi">1000</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">001</span><span class="p">,</span> <span class="n">val</span><span class="p">))</span> <span class="k">as</span> <span class="n">median</span><span class="p">,</span><span class="c1">--1000 buckets, 0.001 target err</span>

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">9</span><span class="p">,</span> <span class="n">uddsketch</span><span class="p">(</span><span class="mi">1000</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">001</span><span class="p">,</span> <span class="n">val</span><span class="p">))</span> <span class="k">as</span> <span class="n">p90</span><span class="p">,</span> 

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="n">uddsketch</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">01</span><span class="p">,</span> <span class="n">val</span><span class="p">))</span> <span class="k">as</span> <span class="n">less_accurate_median</span> <span class="c1">-- modify the terms for the aggregate get a new approximation</span>

<span class="k">FROM</span> <span class="n">foo</span><span class="p">;</span>

<span class="nv">`</span><span class="se">

That’s an example of </span><span class="n">uddsketch</span><span class="nv">, an advanced aggregation method for percentile approximation that can take its own parameters. 

Imagine if the parameters were jumbled together in one aggregate:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="c1">-- NB: THIS IS AN EXAMPLE OF AN API WE DECIDED NOT TO USE, IT DOES NOT WORK</span>

<span class="k">SELECT</span>

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="mi">1000</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">001</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="k">as</span> <span class="n">median</span>

<span class="k">FROM</span> <span class="n">foo</span><span class="p">;</span>

<span class="nv">`</span><span class="se">

It’d be pretty difficult to understand which argument is related to which part of the functionality.

Conversely, the two-step approach separates the arguments to the accessor vs. aggregate very cleanly, where the aggregate function is defined in parenthesis within the inputs of our final function:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">SELECT</span>

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="n">uddsketch</span><span class="p">(</span><span class="mi">1000</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">001</span><span class="p">,</span> <span class="n">val</span><span class="p">))</span> <span class="k">as</span> <span class="n">median</span>

<span class="k">FROM</span> <span class="n">foo</span><span class="p">;</span>

<span class="nv">`</span><span class="se">

By making it clear which is which, users can know that if they change the inputs to the aggregate, they will get more (costly) aggregate nodes, =while inputs to the accessor are cheaper to change. 

So, those are the first two reasons we expose the API - and what it allows developers to do as a result. The last two reasons involve continuous aggregates and how they relate to hyperfunctions, so first, a quick refresher on what they are. 



  
  
  Two-step aggregation + continuous aggregates in TimescaleDB<a name="4"></a>


TimescaleDB includes a feature called continuous aggregates, which are designed to make queries on very large datasets run faster. TimescaleDB continuous aggregates continuously and incrementally store the results of an aggregation query in the background, so when you run the query, only the data that has changed needs to be computed, not the entire dataset. 

In our discussion of the combine function above, we covered how you could take the expensive work of computing the transition function over every row and split the rows over multiple parallel aggregates to speed up the calculation. 

TimescaleDB continuous aggregates do something similar, except they spread the computation work over time rather than between parallel processes running simultaneously. The continuous aggregate computes the transition function over a subset of rows inserted some time in the past, stores the result, and then, at query time, we only need to compute over the raw data for a small section of recent time that we haven’t yet calculated. 

When we designed TimescaleDB hyperfunctions, we wanted them to work well within continuous aggregates and even open new possibilities for users.  

Let’s say I create a continuous aggregate from the simple table above to compute the sum, average, and percentile (the latter using a hyperfunction) in 15 minute increments:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">CREATE</span> <span class="n">MATERIALIZED</span> <span class="k">VIEW</span> <span class="n">foo_15_min_agg</span>

<span class="k">WITH</span> <span class="p">(</span><span class="n">timescaledb</span><span class="p">.</span><span class="n">continuous</span><span class="p">)</span>

<span class="k">AS</span> <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span>

    <span class="n">time_bucket</span><span class="p">(</span><span class="s1">'15 min'</span><span class="p">::</span><span class="n">interval</span><span class="p">,</span> <span class="n">ts</span><span class="p">)</span> <span class="k">as</span> <span class="n">bucket</span><span class="p">,</span>

    <span class="k">sum</span><span class="p">(</span><span class="n">val</span><span class="p">),</span>

    <span class="k">avg</span><span class="p">(</span><span class="n">val</span><span class="p">),</span>

    <span class="n">percentile_agg</span><span class="p">(</span><span class="n">val</span><span class="p">)</span>

<span class="k">FROM</span> <span class="n">foo</span>

<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">id</span><span class="p">,</span> <span class="n">time_bucket</span><span class="p">(</span><span class="s1">'15 min'</span><span class="p">::</span><span class="n">interval</span><span class="p">,</span> <span class="n">ts</span><span class="p">);</span>

<span class="nv">`</span><span class="se">

And then what if I come back and I want to re-aggregate it to hours or days, rather than 15-minute buckets – or need to aggregate my data across all ids? Which aggregates can I do that for, and which can’t I? 

  
  
  Logically consistent rollups


One of the problems we wanted to solve with two-step aggregation was how to convey to the user when it is “okay” to re-aggregate and when it’s not. (By “okay,” I mean you would get the same result from the re-aggregated data as you would running the aggregate on the raw data directly.) 

For instance:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">SELECT</span> <span class="k">sum</span><span class="p">(</span><span class="n">val</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">tab</span><span class="p">;</span>

<span class="c1">-- is equivalent to:</span>

<span class="k">SELECT</span> <span class="k">sum</span><span class="p">(</span><span class="k">sum</span><span class="p">)</span> 

<span class="k">FROM</span> 

    <span class="p">(</span><span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="k">sum</span><span class="p">(</span><span class="n">val</span><span class="p">)</span> 

    <span class="k">FROM</span> <span class="n">tab</span>

    <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">id</span><span class="p">)</span> <span class="n">s</span><span class="p">;</span>

<span class="nv">`</span><span class="se">

But:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">SELECT</span> <span class="k">avg</span><span class="p">(</span><span class="n">val</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">tab</span><span class="p">;</span>

<span class="c1">-- is NOT equivalent to:</span>

<span class="k">SELECT</span> <span class="k">avg</span><span class="p">(</span><span class="k">avg</span><span class="p">)</span> 

<span class="k">FROM</span> 

    <span class="p">(</span><span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="k">avg</span><span class="p">(</span><span class="n">val</span><span class="p">)</span> 

    <span class="k">FROM</span> <span class="n">tab</span>

    <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">id</span><span class="p">)</span> <span class="n">s</span><span class="p">;</span>

<span class="nv">`</span><span class="se">

Why is re-aggregation okay for </span><span class="k">sum</span><span class="nv"> but not for </span><span class="k">avg</span><span class="nv">? 

Technically, it’s logically consistent to re-aggregate when: 


The aggregate returns the internal aggregate state. The internal aggregate state for sum is </span><span class="p">(</span><span class="k">sum</span><span class="p">)</span><span class="nv">, whereas for average, it is</span><span class="p">(</span><span class="k">sum</span><span class="p">,</span> <span class="k">count</span><span class="p">)</span><span class="nv">.
The aggregate’s combine and transition functions are equivalent. For </span><span class="k">sum</span><span class="p">()</span><span class="nv">, the states and the operations are the same. For </span><span class="k">count</span><span class="p">()</span><span class="nv">, the states are the same, but the transition and combine functions perform different operations on them. </span><span class="k">sum</span><span class="p">()</span><span class="nv">’s transition function adds the incoming value to the state, and its combine function adds two states together, or a sum of sums.  Conversely, </span><span class="k">count</span><span class="p">()</span><span class="nv">s transition function increments the state for each incoming value, but its combine function adds two states together, or a sum of counts. 


But, you have to have in-depth (and sometimes rather arcane) knowledge about each aggregate’s internals to know which ones meet the above criteria – and therefore, which ones you can re-aggregate.

With the two-step aggregate approach, we can convey when it is logically consistent to re-aggregate by exposing our equivalent of the combine function when the aggregate allows it. 

We call that function </span><span class="k">rollup</span><span class="p">()</span><span class="nv">. </span><span class="k">Rollup</span><span class="p">()</span><span class="nv"> takes multiple inputs from the aggregate and combines them into a single value. 

All of our aggregates that can be combined have </span><span class="k">rollup</span><span class="nv"> functions that will combine the output of the aggregate from two different groups of rows. (Technically, </span><span class="k">rollup</span><span class="p">()</span><span class="nv"> is an aggregate function because it acts on multiple rows. For clarity, I’ll call them rollup functions to distinguish them from the base aggregate).  Then you can call the accessor on the combined output! 

So using that continuous aggregate we created to get a 1 day re-aggregation of our </span><span class="n">percentile_agg</span><span class="nv"> becomes as simple as:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> 

    <span class="n">time_bucket</span><span class="p">(</span><span class="s1">'1 day'</span><span class="p">::</span><span class="n">interval</span><span class="p">,</span> <span class="n">bucket</span><span class="p">)</span> <span class="k">as</span> <span class="n">bucket</span><span class="p">,</span> 

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="k">rollup</span><span class="p">(</span><span class="n">percentile_agg</span><span class="p">))</span> <span class="k">as</span> <span class="n">median</span>

<span class="k">FROM</span> <span class="n">foo_15_min_agg</span>

<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">id</span><span class="p">,</span> <span class="n">time_bucket</span><span class="p">(</span><span class="s1">'1 day'</span><span class="p">::</span><span class="n">interval</span><span class="p">,</span> <span class="n">bucket</span><span class="p">);</span>

<span class="nv">`</span><span class="se">

(We actually suggest that you create your continuous aggregates without calling the accessor function for this very reason. Then, you can just create views over top or put the accessor call in your query). 

This brings us to our final reason.

  
  
  Retrospective analysis using continuous aggregates


When we create a continuous aggregate, we’re defining a view of our data that we then could be stuck with for a very long time. 

For example, we might have a data retention policy that deletes the underlying data after X time period. If we want to go back and re-calculate anything, it can be challenging, if not impossible, since we’ve “dropped” the data. 

But, we understand that in the real world, you don’t always know what you’re going to need to analyze ahead of time. 

Thus, we designed hyperfunctions to use the two-step aggregate approach, so they would better integrate with continuous aggregates. As a result, users store the aggregate state in the continuous aggregate view and modify accessor functions without requiring them to recalculate old states that might be difficult (or impossible) to reconstruct (because the data is archived, deleted, etc.). 

The two-step aggregation design also allows for much greater flexibility with continuous aggregates. For instance, let’s take a continuous aggregate where we do the aggregate part of the two-step aggregation like this:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">CREATE</span> <span class="n">MATERIALIZED</span> <span class="k">VIEW</span> <span class="n">foo_15_min_agg</span>

<span class="k">WITH</span> <span class="p">(</span><span class="n">timescaledb</span><span class="p">.</span><span class="n">continuous</span><span class="p">)</span>

<span class="k">AS</span> <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span>

    <span class="n">time_bucket</span><span class="p">(</span><span class="s1">'15 min'</span><span class="p">::</span><span class="n">interval</span><span class="p">,</span> <span class="n">ts</span><span class="p">)</span> <span class="k">as</span> <span class="n">bucket</span><span class="p">,</span>

    <span class="n">percentile_agg</span><span class="p">(</span><span class="n">val</span><span class="p">)</span>

<span class="k">FROM</span> <span class="n">foo</span>

<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">id</span><span class="p">,</span> <span class="n">time_bucket</span><span class="p">(</span><span class="s1">'15 min'</span><span class="p">::</span><span class="n">interval</span><span class="p">,</span> <span class="n">ts</span><span class="p">);</span>

<span class="nv">`</span><span class="se">

When we first create the aggregate, we might only want to get the median:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">SELECT</span>

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="n">percentile_agg</span><span class="p">)</span> <span class="k">as</span> <span class="n">median</span>

<span class="k">FROM</span> <span class="n">foo_15_min_agg</span><span class="p">;</span>

<span class="nv">`</span><span class="se">

But then, later, we decide we want to know the 95th percentile as well. 

Luckily, we don’t have to modify the continuous aggregate; we just modify the parameters to the accessor function in our original query to return the data we want from the aggregate state:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">SELECT</span>

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="n">percentile_agg</span><span class="p">)</span> <span class="k">as</span> <span class="n">median</span><span class="p">,</span>

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">95</span><span class="p">,</span> <span class="n">percentile_agg</span><span class="p">)</span> <span class="k">as</span> <span class="n">p95</span>

<span class="k">FROM</span> <span class="n">foo_15_min_agg</span><span class="p">;</span>

<span class="nv">`</span><span class="se">

And then, if a year later, we want the 99th percentile as well, we can do that too:

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="k">SELECT</span>

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="n">percentile_agg</span><span class="p">)</span> <span class="k">as</span> <span class="n">median</span><span class="p">,</span>

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">95</span><span class="p">,</span> <span class="n">percentile_agg</span><span class="p">)</span> <span class="k">as</span> <span class="n">p95</span><span class="p">,</span>

    <span class="n">approx_percentile</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">99</span><span class="p">,</span> <span class="n">percentile_agg</span><span class="p">)</span> <span class="k">as</span> <span class="n">p99</span>

<span class="k">FROM</span> <span class="n">foo_15_min_agg</span><span class="p">;</span>

<span class="nv">`</span><span class="se">

That’s just scratching the surface. Ultimately, our goal is to provide a high level of developer productivity that enhances other PostgreSQL and TimescaleDB features, like aggregate deduplication and continuous aggregates. 



  
  
  An example of how the two-step aggregate design impacts hyperfunctions’ code<a name="5"></a>


To illustrate how the two-step aggregate design pattern impacts how we think about and code hyperfunctions, let’s look at the time-weighted average family of functions. (Our what time-weighted averages are and why you should care post provides a lot of context for this next bit, so if you haven’t read it, we recommend doing so. You can also skip this next bit for now.)

The equation for the time-weighted average is as follows:

 $t im e_{w} e i g h t e d_{a} v er a g e = \frac{a re a _{u} n d er _{c} u r v e}{Δ T}$ 

As we noted in the table above: 



</span><span class="n">time_weight</span><span class="p">()</span><span class="nv"> is TimescaleDB hyperfunctions’ aggregate and corresponds to the transition function in PostgreSQL’s internal API.

</span><span class="n">average</span><span class="p">()</span><span class="nv"> is the accessor, which corresponds to the PostgreSQL final function.

</span><span class="k">rollup</span><span class="p">()</span><span class="nv"> for re-aggregation corresponds to the PostgreSQL combine function. 


The </span><span class="n">time_weight</span><span class="p">()</span><span class="nv"> function returns an aggregate type that has to be usable by the other functions in the family.

In this case, we decided on a </span><span class="n">TimeWeightSummary</span><span class="nv"> type that is defined like so (in pseudocode):

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="n">TimeWeightSummary</span> <span class="o">=</span> <span class="p">(</span><span class="n">w_sum</span><span class="p">,</span> <span class="n">first_pt</span><span class="p">,</span> <span class="n">last_pt</span><span class="p">)</span>

<span class="nv">`</span><span class="se">

</span><span class="n">w_sum</span><span class="nv"> is the weighted sum (another name for the area under the curve), and </span><span class="n">first_pt</span><span class="nv"> and </span><span class="n">last_pt</span><span class="nv"> are the first and last (time, value) pairs in the rows that feed into the </span><span class="n">time_weight</span><span class="p">()</span><span class="nv"> aggregate. 

Here’s a graphic depiction of those elements, which builds on our how to derive a time-weighted average theoretical description:

<figure>

<img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yt3d7d06r1fajo7iiued.jpg" alt="A graph showing value on the y-axis and time on the x-axis. There are four points:  open parens t 1 comma v 1 close parens, labeled first point to open parens t 4 comma  v 4 close parens, labeled last point. The points are spaced unevenly in time on the graph. The area under the graph is shaded, and labeled w underscore sum. The time axis has a brace describing the total distance between the first and last points labeled Delta T. ">

<figcaption align = "center">Depiction of the values we store in the </span><span class="n">TimeWeightSummary</span><span class="nv"> representation. </figcaption>

</figure>

<p>


So, the </span><span class="n">time_weight</span><span class="p">()</span><span class="nv"> aggregate does all of the calculations as it receives each of the points in our graph and builds a weighted sum for the time period (ΔT) between the first and last points it “sees.” It then outputs the </span><span class="n">TimeWeightSummary</span><span class="nv">.

The </span><span class="n">average</span><span class="p">()</span><span class="nv"> accessor function performs simple calculations to return the time-weighted average from the </span><span class="n">TimeWeightSummary</span><span class="nv">  (in pseudocode where </span><span class="n">pt</span><span class="p">.</span><span class="nb">time</span><span class="p">()</span><span class="nv"> returns the time from the point):

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="n">func</span> <span class="n">average</span><span class="p">(</span><span class="n">TimeWeightSummary</span> <span class="n">tws</span><span class="p">)</span> 

<span class="o">-&gt;</span> <span class="nb">float</span> <span class="p">{</span>

<span class="n">delta_t</span> <span class="o">=</span> <span class="n">tws</span><span class="p">.</span><span class="n">last_pt</span><span class="p">.</span><span class="nb">time</span> <span class="o">-</span> <span class="n">tws</span><span class="p">.</span><span class="n">first_pt</span><span class="p">.</span><span class="nb">time</span><span class="p">;</span>

    <span class="n">time_weighted_average</span> <span class="o">=</span> <span class="n">tws</span><span class="p">.</span><span class="n">w_sum</span> <span class="o">/</span> <span class="n">delta_t</span><span class="p">;</span>

<span class="k">return</span> <span class="n">time_weighted_average</span><span class="p">;</span>

<span class="p">}</span>

<span class="nv">`</span><span class="se">

But, as we built the </span><span class="n">time_weight</span><span class="nv"> hyperfunction, ensuring the </span><span class="k">rollup</span><span class="p">()</span><span class="nv"> function worked as expected was a little more difficult – and introduced constraints that impacted the design of our </span><span class="n">TimeWeightSummary</span><span class="nv"> data type. 

To understand the rollup function, let’s use our graphical example and imagine the </span><span class="n">time_weight</span><span class="p">()</span><span class="nv"> function returns two </span><span class="n">TimeWeightSummaries</span><span class="nv"> from different regions of time like so: 

<figure>

<img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wdesd8okhaobc9lk1imk.jpg" alt="A similar graph to the previous, except that now there are two sets of shaded regions. The first is similar to the previous and is labeled with first sub 1 open parens t 1 comma v 1 close parens, last  1 open parens t 4 comma  v 4 close parens , and w underscore sum  1.  The second is similar, with points first 2 open parens t 5 comma  v 4 close parens and last 2 open parens t 8 comma  v 8 close parens and the label w underscore sum 2 on the shaded portion.">

<figcaption align = "center">What happens when we have multiple </span><span class="n">TimeWeightSummaries</span><span class="nv"> representing different regions of the graph. </figcaption>

</figure>

<p>

The </span><span class="k">rollup</span><span class="p">()</span><span class="nv"> function needs to take in and return the same </span><span class="n">TimeWeightSummary</span><span class="nv"> data type so that our </span><span class="n">average</span><span class="p">()</span><span class="nv"> accessor can understand it. (This mirrors how PostgreSQL’s combine function takes in two states from the transition function and then returns a single state for the final function to process). 

We also want the </span><span class="k">rollup</span><span class="p">()</span><span class="nv"> output to be the same as if we had computed the </span><span class="n">time_weight</span><span class="p">()</span><span class="nv"> over all the underlying data. The output should be a </span><span class="n">TimeWeightSummary</span><span class="nv"> representing the full region.  

The </span><span class="n">TimeWeightSummary</span><span class="nv"> we output should also account for the area in the gap between these two weighted sum states:

<figure>

<img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/en9qxli39gr6khvcopkl.jpg" alt="A similar picture to the previous, with the area between the points open parens t 4 comma  v 4 close parens aka last 1 and open parens t 5 comma  v 5 close parens aka first 2, down to the time axis highlighted. This is called w underscore sum gap. ">

<figcaption align = "center">Mind the gap! (between one </span><span class="n">TimeWeightSummary</span><span class="nv"> and the next).</figcaption>

</figure>

<p>

The gap area is easy to get because we have the last_1 and first_2 points - and it’s the same as the </span><span class="n">w_sum</span><span class="nv"> we’d get by running the</span><span class="n">time_weight</span><span class="p">()</span><span class="nv"> aggregate on them.

Thus, the overall </span><span class="k">rollup</span><span class="p">()</span><span class="nv"> function needs to do something like this (where </span><span class="n">w_sum</span><span class="p">()</span><span class="nv"> extracts the weighted sum from the </span><span class="n">TimeWeightSummary</span><span class="nv">):

</span><span class="nv">`</span><span class="k">SQL</span>

<span class="n">func</span> <span class="k">rollup</span><span class="p">(</span><span class="n">TimeWeightSummary</span> <span class="n">tws1</span><span class="p">,</span> <span class="n">TimeWeightSummary</span> <span class="n">tws2</span><span class="p">)</span> 

<span class="o">-&gt;</span> <span class="n">TimeWeightSummary</span> <span class="p">{</span>

<span class="n">w_sum_gap</span> <span class="o">=</span> <span class="n">time_weight</span><span class="p">(</span><span class="n">tws1</span><span class="p">.</span><span class="n">last_pt</span><span class="p">,</span> <span class="n">tws2</span><span class="p">.</span><span class="n">first_pt</span><span class="p">).</span><span class="n">w_sum</span><span class="p">;</span>

<span class="n">w_sum_total</span> <span class="o">=</span> <span class="n">w_sum_gap</span> <span class="o">+</span> <span class="n">tws1</span><span class="p">.</span><span class="n">w_sum</span> <span class="o">+</span> <span class="n">tws2</span><span class="p">.</span><span class="n">w_sum</span><span class="p">;</span>

<span class="k">return</span> <span class="n">TimeWeightSummary</span><span class="p">(</span><span class="n">w_sum_total</span><span class="p">,</span> <span class="n">tws1</span><span class="p">.</span><span class="n">first_pt</span><span class="p">,</span> <span class="n">tws2</span><span class="p">.</span><span class="n">last_pt</span><span class="p">);</span>

<span class="p">}</span>

<span class="nv">`</span><span class="se">

Graphically, that means we’d end up with a single </span><span class="n">TimeWeightSummary</span><span class="nv"> representing the whole area:

<figure>

<img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/czw7y3lgsot8q3sppaa2.jpg" alt="Similar to the previous graphs, except that now there is only one region that has been shaded, the combined area of the w underscore sum 1, w underscore sum 2, and w underscore sum gap has become one area, w underscore sum. Only the overall first open parens t 1 comma  v 1 close parens and last open parens t 8 comma  v 8 close parens points are shown.">

<figcaption align = "center">The combined </span><span class="n">TimeWeightSummary</span><span class="nv">. </figcaption>

</figure>

<p>

So that’s how the two-step aggregate design approach ends up affecting the real-world implementation of our time-weighted average hyperfunctions. The above explanations are a bit condensed, but they should give you a more concrete look at how </span><span class="n">time_weight</span><span class="p">()</span><span class="nv"> aggregate, </span><span class="n">average</span><span class="p">()</span><span class="nv"> accessor, and </span><span class="k">rollup</span><span class="p">()</span><span class="nv"> functions work.



  
  
  Summing it up<a name="6"></a>


Now that you’ve gotten a tour of the PostgreSQL aggregate API,  how it inspired us to make the TimescaleDB hyperfunctions two-step aggregate API, and a few examples of how this works in practice, we hope you'll try it out yourself and tell us what you think :). 

If you’d like to get started with hyperfunctions right away, spin up a fully managed TimescaleDB service and try it for free. Hyperfunctions are pre-loaded on each new database service on Timescale Forge, so after you create a new service, you’re all set to use them!

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub, after which you’ll be able to use </span><span class="n">time_weight</span><span class="nv"> and all other hyperfunctions.

If you have questions or comments on this blog post, we’ve started a discussion on our GitHub page, and we’d love to hear from you. (And, if you like what you see, GitHub ⭐ are always welcome and appreciated too!)

We love building in public, and you can view our upcoming roadmap on GitHub for a list of proposed features, features we’re currently implementing, and features available to use today. For reference, the two-step aggregate approach isn’t just used in the stabilized hyperfunctions covered here; it’s also used in many of our experimental features, including:



</span><span class="n">stats_agg</span><span class="p">()</span><span class="nv"> uses two-step aggregation to make simple statistical aggregates, like average and standard deviation, easier to work with in continuous aggregates and to simplify computing rolling averages. 

</span><span class="n">counter_agg</span><span class="p">()</span><span class="nv"> uses two-step aggregation to make working with counters more efficient and composable.

</span><span class="n">Hyperloglog</span><span class="nv"> uses two-step aggregation in conjunction with continuous aggregates to give users faster approximate COUNT DISTINCT rollups over longer periods of time. 


These features will be stabilized soon, but we’d love your feedback while the APIs are still evolving. What would make them more intuitive? Easier to use? Open an issue or start a discussion!

What time-weighted averages are and why you should care

davidkohn88 — Thu, 29 Jul 2021 12:41:56 +0000

Learn how time-weighted averages are calculated, why they’re so powerful for data analysis, and how to use TimescaleDB hyperfunctions to calculate them faster – all using SQL.

Many people who work with time-series data have nice, regularly sampled datasets. Data could be sampled every few seconds, or milliseconds, or whatever they choose, but by regularly sampled, we mean the time between data points is basically constant. Computing the average value of data points over a specified time period in a regular dataset is a relatively well-understood query to compose. But for those who don't have regularly sampled data, getting a representative average over a period of time can be a complex and time-consuming query to write. Time-weighted averages are a way to get an unbiased average when you are working with irregularly sampled data.

Time-series data comes at you fast, sometimes generating millions of data points per second (read more about time-series data). Because of the sheer volume and rate of information, time-series data can already be complex to query and analyze, which is why we built TimescaleDB, a multi-node, petabyte-scale, completely free relational database for time-series.

Irregularly sampled time-series data just adds another level of complexity – and is more common than you may think. For example, irregularly sampled data, and thus the need for time-weighted averages, frequently occurs in:

Industrial IoT, where teams “compress” data by only sending points when the value changes
Remote sensing, where sending data back from the edge can be costly, so you only send high-frequency data for the most critical operations
Trigger-based systems, where the sampling rate of one sensor is affected by the reading of another (i.e., a security system that sends data more frequently when a motion sensor is triggered)
...and many, many more

At Timescale, we’re always looking for ways to make developers’ lives easier, especially when they’re working with time-series data. To this end, we introduced hyperfunctions, new SQL functions that simplify working with time-series data in PostgreSQL. One of these hyperfunctions enables you to compute time-weighted averages quickly and efficiently, so you gain hours of productivity.

Read on for examples of time-weighted averages, how they’re calculated, how to use the time-weighted averages hyperfunctions in TimescaleDB, and some ideas for how you can use them to get a productivity boost for your projects, no matter the domain.

If you’d like to get started with the time_weight hyperfunction - and many more - right away, spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days. Hyperfunctions are pre-loaded on each new database service on Timescale Forge, so after you create a new service, you’re all set to use them!

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub, after which you’ll be able to use time_weight and other hyperfunctions.

Finally, we love building in public and continually improving:

If you have questions or comments on this blog post, we’ve started a discussion on our GitHub page, and we’d love to hear from you. (And, if you like what you see, GitHub ⭐ are always welcome and appreciated too!)
You can view our upcoming roadmap on GitHub for a list of proposed features, as well as features we’re currently implementing and those that are available to use today.

What are time-weighted averages?

I’ve been a developer at Timescale for over 3 years and worked in databases for about 5 years, but I was an electrochemist before that. As an electrochemist, I worked for a battery manufacturer and saw a lot of charts like these:

Example battery discharge curve, which describes how long a battery can power something. (Also a prime example of where time-weighted averages are 💯 necessary) Source: https://www.nrel.gov/docs/fy17osti/67809.pdf

That’s a battery discharge curve, which describes how long a battery can power something. The x-axis shows capacity in Amp-hours, and since this is a constant current discharge, the x-axis is really just a proxy for time. The y-axis displays voltage, which determines the battery’s power output; as you continue to discharge the battery, the voltage drops until it gets to a point where it needs to be recharged.

When we’d do R&D for new battery formulations, we’d cycle many batteries many times to figure out which formulations make batteries last the longest.

If you look more closely at the discharge curve, you’ll notice that there are only two “interesting” sections:

Example battery discharge curve, calling out the “interesting bits” (the points in time where data changes rapidly)

These are the parts at the beginning and end of the discharge where the voltage changes rapidly. Between these two sections, there’s that long period in the middle, where the voltage hardly changes at all:

Example battery discharge curve, calling out the “boring bits” (the points in time where the data remains fairly constant)

Now, when I said before that I was an electrochemist, I will admit that I was exaggerating a little bit. I knew enough about electrochemistry to be dangerous, but I worked with folks with PhDs who knew a lot more than I did.

But, I was often better than them at working with data, so I’d do things like programming the potentiostat, the piece of equipment you hook the battery up to in order to perform these tests.

For the interesting parts of the discharge cycle (those parts at the start and end), we could have the potentiostat sample at its max rate, usually a point every 10 milliseconds or so. We didn’t want to sample as many data points during the long, boring parts where the voltage didn’t change because it would mean saving lots of data with unchanging values and wasting storage.

To reduce the boring data we’d have to deal with without losing the interesting bits, we’d set up the program to sample every 3 minutes, or when the voltage changed by a reasonable amount, say more than 5 mV.

In practice, what would happen is something like this:

Example battery discharge curve with data points superimposed to depict rapid sampling during the interesting bits and slower sampling during the boring bits.

By sampling the data in this way, we'd get more data during the interesting parts and less data during the boring middle section. That’s great!

It let us answer more interesting questions about the quickly changing parts of the curve and gave us all the information we needed about the slowly changing sections – without storing gobs of redundant data. But, here’s a question: given this dataset, how do we find the average voltage during the discharge?

That question is important because it was one of the things we could compare between this discharge curve and future ones, say 10 or 100 cycles later. As a battery ages, its average voltage drops, and how much it dropped over time could tell us how well the battery’s storage capacity held up during its lifecycle – and if it could turn into a useful product.

The problem is that the data in the interesting bits is sampled more frequently (i.e., there are more data points for the interesting bits), which would give it more weight when calculating the average, even though it shouldn't.

Example battery discharge curve, with illustrative data points to show that while we collect more data during the interesting bits, they shouldn’t count “extra.”

If we just took a naive average over the whole curve, adding the value at each point and dividing by the number of points, it would mean that a change to our sampling rate could change our calculated average...even though the underlying effect was really the same!

We could easily overlook any of the differences we were trying to identify – and any clues about how we could improve the batteries could just get lost in the variation of our sampling protocol.

Now, some people will say: well, why not just sample at max rate of the potentiostat, even during the boring parts? Well, these discharge tests ran really long. They’d take 10 to 12 hours to complete, but the interesting bits could be pretty short, from seconds or minutes. If we sampled at the highest rate, one every 10ms or so, it would mean orders of magnitude more data to store even though we would hardly use any of it! And orders of magnitude more data would mean more cost, more time for analysis, all sorts of problems.

So the big question is: how do we get a representative average when we’re working with irregularly spaced data points?

Let’s get theoretical for a moment here:

(This next bit is a little equation-heavy, but I think they’re relatively simple equations, and they map very well onto their graphical representation. I always like it when folks give me the math and graphical intuition behind the calculations – but if you want to skip ahead to just see how time-weighted average is used, the mathy bits end here.)

Mathy Bits: How to derive a time-weighted average

Let’s say we have some points like this:

A theoretical, irregularly sampled time-series dataset

Then, the normal average would be the sum of the values, divided by the total number of points:

a vg = \frac{( v _{1} + v _{2} + v _{3} + v _{4} )}{4}

But, because they’re irregularly spaced, we need some way to account for that.

One way to think about it would be to get a value at every point in time, and then divide it by the total amount of time. This would be like getting the total area under the curve and dividing by the total amount of time ΔT.

The area under an irregularly sampled time-series dataset

b e tt er_a vg = \frac{a re a _ u n d er _ c u r v e}{Δ T}

(In this case, we’re doing a linear interpolation between the points). So, let’s focus on finding that area. The area between the first two points is a trapezoid:

A trapezoid representing the area under the first two points

Which is really a rectangle plus a triangle:

That same trapezoid broken down into a rectangle and a triangle.

Okay, let's calculate that area:

a re a = Δ t_{1} v_{1} + \frac{Δ t _{1} Δ v _{1}}{2}

So just to be clear, that's:

a re a = Δ t_{1} v_{1} area of rectangle + \frac{Δ t _{1} Δ v _{1}}{2} area of triangle

Okay. So now if we notice that:

Δ v_{1} = v_{2} - v_{1}

We can simplify this equation pretty nicely, start with:

Δ t_{1} v_{1} + \frac{Δ t _{1} ( v _{2} - v _{1} )}{2}

Factor out

\frac{Δ t _{1}}{2}

to get:

\frac{Δ t _{1}}{2} (2 v_{1} + (v_{2} - v_{1}))

Simplify:

\frac{Δ t _{1}}{2} (v_{1} + v_{2})

One cool thing to note is that this gives us a new way to think about this solution: it's the average of each pair of adjacent values, weighted by the time between them:

a re a = \frac{( v _{1} + v _{2} )}{2} average of v_{1} and v_{2} Δ t_{1}

It’s also equal to the area of the rectangle drawn to the midpoint between v1 and v2:

The area of the trapezoid and of the rectangle, drawn to the midpoint between the two points, is the same.

Now that we’ve derived the formula for two adjacent points, we can repeat this for every pair of adjacent points in the dataset. Then all we need to do is sum that up, and that will be the time-weighted sum, which is equal to the area under the curve. (Folks who have studied calculus may actually remember some of this from when they were learning about integrals and integral approximations!)

With the total area under the curve calculated, all we have to do is divide the time-weighted sum by the overall ΔT and we have our time-weighted average. 💥

Now that we've worked through our time-weighted average in theory, let’s test it out in SQL.

How to compute time-weighted averages in SQL

Let’s consider the scenario of an ice cream manufacturer or shop owner who is monitoring their freezers. It turns out that ice cream needs to stay in a relatively narrow range of temperatures (~0-10℉)¹ so that it doesn’t melt and re-freeze, causing those weird crystals that no one likes. Similarly, if ice cream gets too cold, it’s too hard to scoop.

The air temperature in the freezer will vary a bit more dramatically as folks open and close the door, but the ice cream temperature takes longer to change. Thus, problems (melting, pesky ice crystals) will only happen if it's exposed to extreme temperatures for a prolonged period. By measuring this data, the ice cream manufacturer can impose quality controls on each batch of product they’re storing in the freezer.

Taking this into account, the sensors in the freezer measure temperature in the following way: when the door is closed and we’re in the optimal range, the sensors take a measurement every 5 minutes; when the door is opened, the sensors take a measurement every 30 seconds until the door is closed, and the temperature has returned below 10℉.

To model that we might have a simple table like this:

CREATE TABLE freezer_temps (
    freezer_id int,
    ts timestamptz,
    temperature float);

And some data like this:

INSERT INTO freezer_temps VALUES 
( 1, '2020-01-01 00:00:00+00', 4.0), 
( 1, '2020-01-01 00:05:00+00', 5.5), 
( 1, '2020-01-01 00:10:00+00', 3.0), 
( 1, '2020-01-01 00:15:00+00', 4.0), 
( 1, '2020-01-01 00:20:00+00', 3.5), 
( 1, '2020-01-01 00:25:00+00', 8.0), 
( 1, '2020-01-01 00:30:00+00', 9.0), 
( 1, '2020-01-01 00:31:00+00', 10.5), -- door opened!
( 1, '2020-01-01 00:31:30+00', 11.0), 
( 1, '2020-01-01 00:32:00+00', 15.0), 
( 1, '2020-01-01 00:32:30+00', 20.0), -- door closed
( 1, '2020-01-01 00:33:00+00', 18.5), 
( 1, '2020-01-01 00:33:30+00', 17.0), 
( 1, '2020-01-01 00:34:00+00', 15.5), 
( 1, '2020-01-01 00:34:30+00', 14.0), 
( 1, '2020-01-01 00:35:00+00', 12.5), 
( 1, '2020-01-01 00:35:30+00', 11.0), 
( 1, '2020-01-01 00:36:00+00', 10.0), -- temperature stabilized
( 1, '2020-01-01 00:40:00+00', 7.0),
( 1, '2020-01-01 00:45:00+00', 5.0);

The period after the door opens, minutes 31-36, has a lot more data points. If we were to take the average of all the points, we would get a misleading value. The freezer was only above the threshold temperature for 5 out of 45 minutes (11% of the time period), but those minutes make up 10 out of 20 data points (50%!) because we sample freezer temperature more frequently after the door is opened.

To find the more accurate, time-weighted average temperature, let’s write the SQL for the formula above that handles that case. We’ll also get the normal average just for comparison’s sake. (Don’t worry if you have trouble reading it, we’ll write a much simpler version later).

WITH setup AS (
    SELECT lag(temperature) OVER (PARTITION BY freezer_id ORDER BY ts) as prev_temp, 
        extract('epoch' FROM ts) as ts_e, 
        extract('epoch' FROM lag(ts) OVER (PARTITION BY freezer_id ORDER BY ts)) as prev_ts_e, 
        * 
    FROM  freezer_temps), 
nextstep AS (
    SELECT CASE WHEN prev_temp is NULL THEN NULL 
        ELSE (prev_temp + temperature) / 2 * (ts_e - prev_ts_e) END as weighted_sum, 
        * 
    FROM setup)
SELECT freezer_id,
avg(temperature), -- the regular average
    sum(weighted_sum) / (max(ts_e) - min(ts_e)) as time_weighted_average -- our derived average
FROM nextstep
GROUP BY freezer_id;

 freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.2  |     6.636111111111111

It does return what we want, and gives us a much better picture of what happened, but it’s not exactly fun to write, is it?

We’ve got a few window functions in there, some case statements to deal with nulls, and several CTEs to try to make it reasonably clear what’s going on. This is the kind of thing that can really lead to code maintenance issues when people try to figure out what’s going on and tweak it.

Code is all about managing complexity, lots of complex queries to accomplish a relatively simple task makes it much less likely that the developer who comes along next (or you in 3 months) will understand what’s going on, how to use it, or how to change it if they (or you!) need a different result. Or, worse, it means that the code will never get changed because people don’t quite understand what the query’s doing, and it just becomes a black box that no one wants to touch (including you).

TimescaleDB hyperfunctions to the rescue!

This is why we created hyperfunctions, to make complicated time-series data analysis less complex. Let’s look at what the time-weighted average freezer temperature query looks like if we use the hyperfunctions for computing time-weighted averages:

SELECT freezer_id, 
    avg(temperature), 
    average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY freezer_id;


 freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.2  |     6.636111111111111

Isn’t that so much more concise?! Calculate a time_weight with a ’Linear’ weighting method (that’s the kind of weighting derived above ²), then take the average of the weighted values, and we’re done. I like that API much better (and I’d better, because I designed it!).

What’s more, not only do we save ourselves from writing all that SQL, but it also becomes far, far easier to compose (build up more complex analyses over top of the time-weighted average). This is a huge part of the design philosophy behind hyperfunctions; we want to make fundamental things simple so that you can easily use them to build more complex, application-specific analyses.

Let’s imagine we’re not satisfied with the average over our entire dataset, and we want to get the time-weighted average for every 10-minute bucket:

SELECT time_bucket('10 mins'::interval, ts) as bucket, 
    freezer_id, 
    avg(temperature), 
    average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY bucket, freezer_id;

We added a time_bucket, grouped by it, and done! Let’s look at some other kinds of sophisticated analysis that hyperfunctions enable.

Continuing with our ice cream example, let’s say that we’ve set our threshold because we know that if the ice cream spends more than 15 minutes above 15 ℉, it’ll develop those ice crystals[^ice-cream-footnote] that make it all sandy/grainy tasting. We can use the time-weighted average in a window function to see if that happened:

SELECT *, 
average(time_weight('Linear', ts, temperature) OVER fifteen_min) as rolling_twa
FROM freezer_temps
WINDOW fifteen_min AS 
(PARTITION BY freezer_id ORDER BY ts RANGE  '15 minutes'::interval PRECEDING)
ORDER BY freezer_id, ts;

 freezer_id |           ts           | temperature |    rolling_twa     
------------+------------------------+-------------+--------------------
          1 | 2020-01-01 00:00:00+00 |           4 |                   
          1 | 2020-01-01 00:05:00+00 |         5.5 |               4.75
          1 | 2020-01-01 00:10:00+00 |           3 |                4.5
          1 | 2020-01-01 00:15:00+00 |           4 |  4.166666666666667
          1 | 2020-01-01 00:20:00+00 |         3.5 | 3.8333333333333335
          1 | 2020-01-01 00:25:00+00 |           8 |  4.333333333333333
          1 | 2020-01-01 00:30:00+00 |           9 |                  6
          1 | 2020-01-01 00:31:00+00 |        10.5 |  7.363636363636363
          1 | 2020-01-01 00:31:30+00 |          11 |  7.510869565217392
          1 | 2020-01-01 00:32:00+00 |          15 |  7.739583333333333
          1 | 2020-01-01 00:32:30+00 |          20 |               8.13
          1 | 2020-01-01 00:33:00+00 |        18.5 |  8.557692307692308
          1 | 2020-01-01 00:33:30+00 |          17 |  8.898148148148149
          1 | 2020-01-01 00:34:00+00 |        15.5 |  9.160714285714286
          1 | 2020-01-01 00:34:30+00 |          14 |   9.35344827586207
          1 | 2020-01-01 00:35:00+00 |        12.5 |  9.483333333333333
          1 | 2020-01-01 00:35:30+00 |          11 | 11.369047619047619
          1 | 2020-01-01 00:36:00+00 |          10 | 11.329545454545455
          1 | 2020-01-01 00:40:00+00 |           7 |             10.575
          1 | 2020-01-01 00:45:00+00 |           5 |  9.741666666666667

The window here is over the previous 15 minutes, ordered by time. And it looks like we stayed below our ice-crystallization temperature!

We also provide a special rollup function so you can re-aggregate time-weighted values from subqueries. For instance:

SELECT average(rollup(time_weight)) as time_weighted_average 
FROM (SELECT time_bucket('10 mins'::interval, ts) as bucket, 
        freezer_id, 
        time_weight('Linear', ts, temperature)
    FROM freezer_temps
    GROUP BY bucket, freezer_id) t;

time_weighted_average 
-----------------------
    6.636111111111111

This will give us the same output as a grand total of the first equation because we’re just re-aggregating the bucketed values.

But this is mainly there so that you can do more interesting analysis, like, say, normalizing each ten-minute time-weighted average by freezer to the overall time-weighted average.

WITH t as (SELECT time_bucket('10 mins'::interval, ts) as bucket, 
        freezer_id, 
        time_weight('Linear', ts, temperature)
    FROM freezer_temps
    GROUP BY bucket, freezer_id) 
SELECT bucket, 
    freezer_id, 
    average(time_weight) as bucketed_twa,  
    (SELECT average(rollup(time_weight)) FROM t) as overall_twa, 
    average(time_weight) / (SELECT average(rollup(time_weight)) FROM t) as normalized_twa
FROM t;

This kind of feature (storing the time-weight for analysis later) is most useful in a continuous aggregate, and it just so happens that we’ve designed our time-weighted average to be usable in that context!

We’ll be going into more detail on that in a future post, so be sure to subscribe to our newsletter so you can get notified when we publish new technical content.

Try time-weighted averages today

If you’d like to get started with the time_weight hyperfunction - and many more - right away, spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days. Hyperfunctions are pre-loaded on each new database service on Timescale Forge, so after you create a new service, you’re all set to use them!

If you have questions or comments on this blog post, we’ve started a discussion on our GitHub page, and we’d love to hear from you. (And, if you like what you see, GitHub ⭐ are always welcome and appreciated too!)
We love building in public, and you can view our upcoming roadmap on GitHub for a list of proposed features, features we’re currently implementing, and features available to use today.

We’d like to give a special thanks to @inselbuch, who submitted the GitHub issue that got us started on this project (as well as the other folks who 👍’d it and let us know they wanted to use it.)

We believe time-series data is everywhere, and making sense of it is crucial for all manner of technical problems. We built hyperfunctions to make it easier for developers to harness the power of time-series data. We’re always looking for feedback on what to build next and would love to know how you’re using hyperfunctions, problems you want to solve, or things you think should - or could - be simplified to make analyzing time-series data in SQL that much better. (To contribute feedback, comment on an open issue or in a discussion thread in GitHub.)

Lastly, in future posts, we’ll give some more context around our design philosophy, decisions we’ve made around our APIs for time-weighted averages (and other features), and detailing how other hyperfunctions work. So, if that’s your bag, you’re in luck – but you’ll have to wait a week or two.

I don’t know that these times or temperatures are accurate per se; however, the phenomenon of ice cream partially melting and refreezing causing larger ice crystals to form - and coarsening the ice cream as a result - is well documented. See, for instance, Harold McGee’s On Food And Cooking (p 44 in the 2004 revised edition). So, just in case you are looking for advice on storing your ice cream from a blog about time-series databases: for longer-term storage, you would likely want the ice cream to be stored below 0℉. Our example is more like a scenario you’d see in an ice cream display (e.g., in an ice cream parlor or factory line) since the ice cream is kept between 0-10℉ (ideal for scooping, because lower temperatures make ice cream too hard to scoop). ↩
We also offer ’LOCF’ or last observation carried forward weighting, which is best suited to cases where you record data points whenever the value changes (i.e., the old value is valid until you get a new one.) The derivation for that is similar, except the rectangles have the height of the first value, rather than the linear weighting we’ve discussed in this post (i.e., where we do linear interpolation between adjacent data points):
LOCF weighting is useful when you know the value is constant until the following point. Rather than: Linear weighting is useful when you are sampling a changing value at irregular intervals. In general, linear weighting is appropriate for cases where the sampling rate is variable, but there are no guarantees provided by the system about only providing data when it changes. LOCF works best when there’s some guarantee that your system will provide data only when it changes, and you can accurately carry the old value until you receive a new one. ↩

DEV Community: davidkohn88

How percentile approximation works (and why it's more useful than averages)

Table of contents

Things I forgot from 7th grade math: percentiles vs. averages

Long tails, outliers, and real effects: Why percentiles are better than averages for understanding your data

How percentiles work in PostgreSQL

Percentile approximation: what it is and why we use it in TimescaleDB hyperfunctions

Memory footprint

Parallelization in single and multi-node TimescaleDB

Materialization in continuous aggregates

Percentile approximation deep dive: approximation methods, how they work, and how to choose

Approximation methods and how they work

Using advanced approximation methods in TimescaleDB hyperfunctions

Wrapping it up

Function pipelines: Building functional programming into PostgreSQL using custom operators

Table of contents

Function pipelines: why are they useful?

How we built function pipelines without forking PostgreSQL

A custom data type: the timevector

A custom operator: ->

Custom functions: pipeline elements

timevector transforms

Unary mathematical

Binary mathematical

Compound transforms

Lambda elements

timevector finalizers

timevector output

timevector aggregates

Aggregate accessors and mutators

Counter aggregates

Percentile approximation

Statistical aggregates

Time weighted averages

Approximate count distinct (Hyperloglog)

Next steps

How PostgreSQL aggregation works and how it inspired our hyperfunctions’ design

Table of contents

A primer on PostgreSQL aggregation (through pictures)

PostgreSQL aggregates vs. functions

Improving the performance of aggregate functions

Parallelization and the combine function

Deduplication <a name="deduplication"></a>

Two-step aggregation in TimescaleDB hyperfunctions<a name="two-step-in-tsdb"></a>

Why we use the two-step aggregate design pattern<a name="3"></a>

Re-using state

Cleanly distinguishing between aggregate/accessor parameters

Two-step aggregation + continuous aggregates in TimescaleDB<a name="4"></a>

Logically consistent rollups

Retrospective analysis using continuous aggregates

An example of how the two-step aggregate design impacts hyperfunctions’ code<a name="5"></a>

Summing it up<a name="6"></a>

What time-weighted averages are and why you should care

What are time-weighted averages?

Mathy Bits: How to derive a time-weighted average

How to compute time-weighted averages in SQL

TimescaleDB hyperfunctions to the rescue!

Try time-weighted averages today

`timevector` transforms

`timevector` finalizers

`timevector` output

`timevector` aggregates