Watch OVER your WINDOW of opportunities

#sql #googlecloud #covid

Howdy!

From the very first time I heard of SQL Window Functions, I have always been amazed at how powerful it is. Also found that a lot of folks in my circle don't really understand where or how to use it.

Thus, I decided to write a series of articles that explain SQL Window functions, what they are, where they can be used etc.

To make the examples more fun and easily replicable, we would be answering questions about the COVID-19 pandemic by analysing data from the Google COVID-19 BigQuery Dataset.

In particular the bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide table.

Let's begin our journey by understanding some definitions.

What are SQL Window Functions

Window functions are used to perform aggregations without grouping the unaggregated data to a single row. Also, they are used to perform calculations that are relative to the current row(eg running sums/averages).

A more complete definition from the PostgreSQL documentation is seen below:

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

When should we use them

I think the best way to make sense of the definition above is to look at some use cases of window functions. They come in handy when:

Comparing aggregated data to unaggregated data.
Performing a calculation that is dependent on the values of previous rows(eg. running totals/averages, increase of a particular value with time etc.).
Ranking data.
Performing calculations that involves breaking the resultSet into section(technically called 'windows').

We'll take a deeper dive into these in subsequent articles in the series. For now, let's look at the basic structure of a query with a window function

Syntax of Window Functions

Window functions can be created in the same way we specify columns in a SQL query. The key difference is that we use the OVER clause to specify that it is a window function.


SELECT 
   _agg_func_call_ OVER([window_aggs]) AS col_name,
   other_col_1,
   other_col_2, 
   ...,
   other_col_n

FROM _table_name_
[other_select parts]

As you can see, the window function is part of the SELECT statement. SQL knows that it is a window function if it has the OVER() function.

The agg_func_call here can be any aggregate function eg AVG, MIN, MAX, SUM, etc.

Note:
One very important thing to bear in mind here is the order of operations of an SQL statement and when the Window functions are called. The order of operation is as follows

FROM and JOINs

WHERE

GROUP BY

HAVING

SELECT (this is when our window functions run)

DISTINCT

ORDER BY

LIMIT/OFFSET

Thus, our Window functions can access the ResultSet after the HAVING. As we would see in subsequent articles, whenever we have to choose between a window function and a Subquery solution, the window function would usually be faster.

This is because a subquery performs its calculation by running through the entire table while a window function performs calculation with ResultSet produced from the WHERE clause.

OVER function arguments

The OVER function optionally takes some arguments
The full syntax of the argument of the OVER() function is shown here.


OVER(
    [PARTITION BY args] 
    [ORDER BY args] 
    [{ROWS| ROWS BETWEEN} args]
)

ORDER BY col_1, ..., col_n

This has the same syntax as when used at the end of a SELECT statement.

In the window function, however, it sorts the ResultSet which would be used in the window function. We would see examples of this later on.

[ROWS rows_args| ROWS BETWEEN starting_row BETWEEN ending_row]

This is used to perform calculations relative to the current row. It comes in very handy when performing aggregations that are relative to previous/next row values eg leading sums/averages, daily rise in cases etc.

PARTITION BY

This is used to break up the data into more windows. This comes in handy in some complex calculations where we want to break up the result-set into smaller 'windows' and perform the calculations within that window.

The examples of this would explain this way better than words will.

So that's what you should expect in this series. I hope you have a fun read.

The next on the series would be Comparing aggregated data to unaggregated data. See you there!