Joash Xu

Posted on Nov 18, 2021 • Originally published at enzircle.com

List of window functions

#sql

PostgreSQL has several built-in window functions, and they provide the ability to perform calculations across rows related to the current query row. In the previous article, I laid out the basic concept of how window functions work. In this article, I will give an overview of a handful of these window functions. The windows functions are split into four categories: aggregate, ranking, distribution, and analytic functions.

The built-in window functions are listed below.

Except for functions in the aggregate column, all other functions require an OVER clause to work.

Aggregate window functions

One use case for a window function is to calculate moving average and moving total. We can achieve this using aggregate functions. Remember that aggregate functions act as window functions only when there is an OVER clause after the function. If you omit the OVER clause, they work like ordinary aggregates and return a single row for the entire set. You can use all built-in plain aggregate functions as a window function. But in this article, I only use five of them.

Functions	What it does
AVG(args)	Average (arithmetic mean) of all the non-null input values within a window frame.
COUNT(args)	Count of non-null values within a window frame.
MAX(args)	Maximum value of the non-null values within a window frame.
MIN(args)	Minimum value of the non-null values within a window frame.
SUM(args)	Sum of non-null values within a window frame.

Aggregate functions do not require an ORDER BY, but they accept window frame definition (ROWS, RANGE, and GROUPS).

The following query shows the aggregate functions in action.

SELECT customer_id AS cid,
       billing_city AS city, 
       total, 
       ROUND(AVG(total) OVER(PARTITION BY billing_city), 2) AS avg,
       COUNT(total) OVER(PARTITION BY billing_city) AS count,
       MAX(total) OVER(PARTITION BY billing_city) AS max,
       MIN(total) OVER(PARTITION BY billing_city) AS min,
       SUM(total) OVER(PARTITION BY billing_city) AS sum
FROM invoice;

 cid |   city    | total | avg  | count |  max  | min  |  sum
-----+-----------+-------+------+-------+-------+------+-------
  48 | Amsterdam |  1.98 | 5.80 |     7 | 13.86 | 0.99 | 40.62
  48 | Amsterdam | 13.86 | 5.80 |     7 | 13.86 | 0.99 | 40.62
  48 | Amsterdam |  0.99 | 5.80 |     7 | 13.86 | 0.99 | 40.62
  48 | Amsterdam |  8.91 | 5.80 |     7 | 13.86 | 0.99 | 40.62
  48 | Amsterdam |  8.94 | 5.80 |     7 | 13.86 | 0.99 | 40.62
  48 | Amsterdam |  3.96 | 5.80 |     7 | 13.86 | 0.99 | 40.62
  48 | Amsterdam |  1.98 | 5.80 |     7 | 13.86 | 0.99 | 40.62
  59 | Bangalore |  1.99 | 6.11 |     6 | 13.86 | 1.98 | 36.64
  59 | Bangalore |  8.91 | 6.11 |     6 | 13.86 | 1.98 | 36.64
  59 | Bangalore |  1.98 | 6.11 |     6 | 13.86 | 1.98 | 36.64
  59 | Bangalore | 13.86 | 6.11 |     6 | 13.86 | 1.98 | 36.64
  59 | Bangalore |  3.96 | 6.11 |     6 | 13.86 | 1.98 | 36.64
  59 | Bangalore |  5.94 | 6.11 |     6 | 13.86 | 1.98 | 36.64
 <------------------------ TRUNCATED ------------------------->

Ranking functions

Another use case for using the window function is assigning an ordinal number to rows, and PostgreSQL provides three functions to handle this.

Functions	What it does
ROW_NUMBER()	Returns the number of the current row within a window partition. The numbers are unique.
RANK()	Returns the rank of the current row within a partition. There will be gaps and tied values.
DENSE_RANK()	Returns the rank of the current row within a partition, without gaps, but there will be tied values. Think of it as counts for peer groups.

ROW_NUMBER() does not require ORDER BY, but RANK() and DENSE_RANK() require ORDER BY to work. None of these functions accept window frame definitions.

The following query show ROW_NUMBER(), RANK(), DENSE_RANK() in action.

SELECT customer_id AS cid,
       billing_city AS city, 
       total, 
       ROW_NUMBER() OVER(PARTITION BY billing_city) AS row_number,
       RANK() OVER(PARTITION BY billing_city ORDER BY total) AS rank,
       DENSE_RANK() OVER(PARTITION BY billing_city ORDER BY total) AS dense_rank
FROM invoice;

 cid | city         | total | row_number | rank | dense_rank
-----+--------------+-------+------------+------+------------
  48 | Amsterdam    |  0.99 |          1 |    1 |          1
  48 | Amsterdam    |  1.98 |          2 |    2 |          2
  48 | Amsterdam    |  1.98 |          3 |    2 |          2
  48 | Amsterdam    |  3.96 |          4 |    4 |          3
  48 | Amsterdam    |  8.91 |          5 |    5 |          4
  48 | Amsterdam    |  8.94 |          6 |    6 |          5
  48 | Amsterdam    | 13.86 |          7 |    7 |          6
  59 | Bangalore    |  1.98 |          1 |    1 |          1
  59 | Bangalore    |  1.99 |          2 |    2 |          2
  59 | Bangalore    |  3.96 |          3 |    3 |          3
  59 | Bangalore    |  5.94 |          4 |    4 |          4
  59 | Bangalore    |  8.91 |          5 |    5 |          5
  59 | Bangalore    | 13.86 |          6 |    6 |          6
 <------------------------ TRUNCATED ------------------------->

Distribution functions

Distribution functions require ORDER BY, and they do not accept window frame definitions.

Percent rank

PERCENT_RANK() returns the relative rank of the current row. It is calculated using a simple formula: (rank - 1) / (total partition rows - 1). The value ranges from 0 to 1 inclusive.

The following query shows PERCENT_RANK() in action. In this particular example, I also partition the data according to the billing city.

SELECT customer_id AS cid,
       billing_city AS city,
       total, 
       PERCENT_RANK() over(PARTITION BY billing_city ORDER BY total)
FROM invoice;

   city        | total |    percent_rank
--------------------+-------+---------------------
 Amsterdam     |  0.99 |                   0
 Amsterdam     |  1.98 | 0.16666666666666666
 Amsterdam     |  1.98 | 0.16666666666666666
 Amsterdam     |  3.96 |                 0.5
 Amsterdam     |  8.91 |  0.6666666666666666
 Amsterdam     |  8.94 |  0.8333333333333334
 Amsterdam     | 13.86 |                   1
 Bangalore     |  1.98 |                   0
 Bangalore     |  1.99 |                 0.2
 Bangalore     |  3.96 |                 0.4
 Bangalore     |  5.94 |                 0.6
 Bangalore     |  8.91 |                 0.8
 Bangalore     | 13.86 |                   1
<---------------- TRUNCATED ---------------->

Cumulative distribution

CUME_DIST() returns the cumulative distribution within a group of values. The value range from 1/N to 1, and you calculate this by dividing the number of rows with values less than or equal to the current row's value to the total number of rows.

The following query show CUME_DIST() in action.

SELECT billing_city as city,
       total, 
       CUME_DIST() over(PARTITION BY billing_city ORDER BY total)
FROM invoice;

PARTITION BY is not mandatory for CUME_DIST, but I used it to show cumulative distribution within a city in this example.

       city         | total |      cume_dist
--------------------------+-------+---------------------
 Amsterdam           |  0.99 | 0.14285714285714285
 Amsterdam           |  1.98 | 0.42857142857142855
 Amsterdam           |  1.98 | 0.42857142857142855
 Amsterdam           |  3.96 |  0.5714285714285714
 Amsterdam           |  8.91 |  0.7142857142857143
 Amsterdam           |  8.94 |  0.8571428571428571
 Amsterdam           | 13.86 |                   1
 Bangalore           |  1.98 | 0.16666666666666666
 Bangalore           |  1.99 |  0.3333333333333333
 Bangalore           |  3.96 |                 0.5
 Bangalore           |  5.94 |  0.6666666666666666
 Bangalore           |  8.91 |  0.8333333333333334
 Bangalore           | 13.86 |                   1

Analytic functions

Lead and lag

LEAD(args, offset, default) returns a value that comes after the current row by an offset. If there is no row after the offset, it will return default. If you don't specify, the offset value will be one, and the default will be NULL.

LAG(args, offset, default) returns a value that comes before the current row by an offset. If there is no row before the offset, it will return default. If you don't specify, the offset value will be one, and the default will be NULL.

The default value for both functions must be of a type that is compatible with the row value. So you might need to cast it or use :: operator to match the column data type.

Both LEAD and LAG require an ORDER BY clause, but they do not accept window frame definition.

The following query shows LEAD() and LAG() in action. Notice to set the default to 0, and I must cast the data type to match the total column data type, which is numeric.

SELECT billing_city AS city,
       invoice_date::DATE as date, 
       total,
       LEAD(total) OVER(
           PARTITION BY billing_city
           ORDER BY invoice_date         
       ) as lead,
       LAG(total, 2, 0::numeric) OVER(
           PARTITION BY billing_city
           ORDER BY invoice_date 
           RANGE BETWEEN UNBOUNDED PRECEDING 
              AND UNBOUNDED FOLLOWING
       ) as lag
FROM invoice;

 city         |    date    | total | lead  |  lag
-------------------+------------+-------+-------+-------
 Amsterdam    | 2009-05-10 |  8.91 |  1.98 |     0
 Amsterdam    | 2010-12-15 |  1.98 |  3.96 |     0
 Amsterdam    | 2011-03-19 |  3.96 |  8.94 |  8.91
 Amsterdam    | 2011-06-21 |  8.94 |  0.99 |  1.98
 Amsterdam    | 2012-02-09 |  0.99 |  1.98 |  3.96
 Amsterdam    | 2013-08-02 |  1.98 | 13.86 |  8.94
 Amsterdam    | 2013-09-12 | 13.86 |       |  0.99
 Bangalore    | 2009-04-05 |  3.96 |  5.94 |     0
 Bangalore    | 2009-07-08 |  5.94 |  1.99 |     0
 Bangalore    | 2010-02-26 |  1.99 |  1.98 |  3.96
 Bangalore    | 2011-08-20 |  1.98 | 13.86 |  5.94
 Bangalore    | 2011-09-30 | 13.86 |  8.91 |  1.99
 Bangalore    | 2012-05-30 |  8.91 |       |  1.98
<------------------ TRUNCATED -------------------->

First and last value

FIRST_VALUE() returns the value for the first row within a window frame.

LAST_VALUE() returns the value for the first row within a window frame. For the last value, you usually want to use RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING; otherwise, you will get the current row's value. This behavior is because the default window frame with an ORDER BY clause is between RANGE UNBOUNDED and CURRENT ROW.

The following query partition the data into billing city and fetch the first and last values of the window frame.

SELECT billing_city AS city,
       invoice_date::DATE as date, 
       total,
       FIRST_VALUE(total) OVER(
           PARTITION BY billing_city
           ORDER BY invoice_date         
       ) as first,
       LAST_VALUE(total) OVER(
           PARTITION BY billing_city
           ORDER BY invoice_date 
           RANGE BETWEEN UNBOUNDED PRECEDING 
              AND UNBOUNDED FOLLOWING
       ) as last
FROM invoice;

  city        |    date    | total | first | last
-------------------+------------+-------+-------+-------
 Amsterdam    | 2009-05-10 |  8.91 |  8.91 | 13.86
 Amsterdam    | 2010-12-15 |  1.98 |  8.91 | 13.86
 Amsterdam    | 2011-03-19 |  3.96 |  8.91 | 13.86
 Amsterdam    | 2011-06-21 |  8.94 |  8.91 | 13.86
 Amsterdam    | 2012-02-09 |  0.99 |  8.91 | 13.86
 Amsterdam    | 2013-08-02 |  1.98 |  8.91 | 13.86
 Amsterdam    | 2013-09-12 | 13.86 |  8.91 | 13.86
 Bangalore    | 2009-04-05 |  3.96 |  3.96 |  8.91
 Bangalore    | 2009-07-08 |  5.94 |  3.96 |  8.91
 Bangalore    | 2010-02-26 |  1.99 |  3.96 |  8.91
 Bangalore    | 2011-08-20 |  1.98 |  3.96 |  8.91
 Bangalore    | 2011-09-30 | 13.86 |  3.96 |  8.91
 Bangalore    | 2012-05-30 |  8.91 |  3.96 |  8.91
<------------------- TRUNCATED ------------------->

ntile

NTILE(n) divides the partition as equally as possible into n groups and assigns an integer ranging from 1 to n.

NTILE requires an ORDER BY clause but does not accept window frame definition.

In the following query, I partition the data by billing city and divide each partition into three groups.

SELECT billing_city AS city,
       total,
       NTILE(3) OVER(PARTITION BY billing_city)
FROM invoice;

        city         | total | ntile
--------------------------+-------+-------
 Amsterdam           |  1.98 |     1
 Amsterdam           | 13.86 |     1
 Amsterdam           |  0.99 |     1
 Amsterdam           |  8.91 |     2
 Amsterdam           |  8.94 |     2
 Amsterdam           |  3.96 |     3
 Amsterdam           |  1.98 |     3
 Bangalore           |  1.99 |     1
 Bangalore           |  8.91 |     1
 Bangalore           |  1.98 |     2
 Bangalore           | 13.86 |     2
 Bangalore           |  3.96 |     3
 Bangalore           |  5.94 |     3
<------------- TRUNCATED ----------->

nth value

NTH_VALUE(args, n) return values at the nth row of the window frame or NULL if the row does not exist. The counting starts from 1.

ORDER BY is not required, but you need to pay attention to the frame similar to the LAST_VALUE function. If you use the ORDER BY clause, you probably need to specify RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING to get the correct value.

In the following query, I partition the data per billing city and sort the total descendingly to get the second largest total for each partition.

SELECT billing_city AS city,
       total,
       NTH_VALUE(total, 2) OVER(
            PARTITION BY billing_city
            ORDER BY total DESC
            RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        )
FROM invoice;

        city         | total | nth_value
--------------------------+-------+-----------
 Amsterdam           | 13.86 |      8.94
 Amsterdam           |  8.94 |      8.94
 Amsterdam           |  8.91 |      8.94
 Amsterdam           |  3.96 |      8.94
 Amsterdam           |  1.98 |      8.94
 Amsterdam           |  1.98 |      8.94
 Amsterdam           |  0.99 |      8.94
 Bangalore           | 13.86 |      8.91
 Bangalore           |  8.91 |      8.91
 Bangalore           |  5.94 |      8.91
 Bangalore           |  3.96 |      8.91
 Bangalore           |  1.99 |      8.91
 Bangalore           |  1.98 |      8.91
<--------------- TRUNCATED ------------->

Windows function parameter requirement summary

The following table summarizes which window functions require an ORDER BY clause and which functions accept window frame definitions like ROWS, RANGE, and GROUPS.

Functions	Requires ORDER BY	Accept Window Frame
avg()	No	Yes
count()	No	Yes
max()	No	Yes
min()	No	Yes
sum()	No	Yes
row_number()	No	No
rank()	Yes	No
dense_rank()	Yes	No
percent_rank()	Yes	No
cume_dist()	Yes	No
lead()	Yes	No
lag()	Yes	No
first_value()	No	Yes
last_value()	No	Yes
ntile()	Yes	No
nth_value()	No	Yes

Wrap up

In this article, I gave an overview of a handful of windows functions. The window functions I have gone through are split into four categories: aggregate, ranking, distribution, and analytics. In the following article, I will show you several real-world use cases for window functions.

DEV Community