DEV Community: Onyinyechi Ofondu

Aggregate Functions in SQL

Onyinyechi Ofondu — Sat, 27 Nov 2021 17:46:27 +0000

Aggregate functions are mathematical computations that return a single value from a range of values which expresses the significance of the aggregated data. They are used to derive descriptive statistics and provide key numbers in different sectors like the health, economic, and business sectors.
The diagram below shows the typical operation on an aggregate function on a specific column and what the result looks like.

In the diagram above we can see the dataset with two columns (column 1 & column 2). Using the SUM function on column 2 we can see that it adds up all the values in that column and returns a single value in the "result" column.
For this article, we shall be using PostgreSQL and Pgadmin4 as the GUI which is one of the best Graphical user interface platforms for PostgreSQL and is very beginner-friendly. You can download it here for your PostgreSQL needs.

Getting back into it, the different aggregate functions are:
SUM: adds up all the values of a specified column.
MIN: the minimum value of a specified column.
MAX: the maximum value of a specified column.
AVG: the average number of values of a specified column.
COUNT: the number of values (rows) of a specified column/table.

Aggregate functions can only be used in the SELECT and HAVING clause where:
The SELECT clause lists or specifies the column that will be returned for the SQL query and
The HAVING clause specifies a search condition for a group or an aggregate.

I created a dataset of movie downloads for this article which contains certain movie names, genres, and the number of downloads. This can be created using the SQL statement below:

Using the dataset above, we will look at the applications of the different aggregate functions.
🔥Let's go!!

COUNT()

The COUNT function is the most straightforward function and the best to start with:

The "COUNT(*)" in line 1 above is used to count all the rows in the dataset. This gives the result in the image below:

However, when the COUNT function is used on a column, it counts only the values in that column that are, not NULL:

looking at the count_aggregate_function_2 snippet of code above and the count_aggregate_function_1 snippet before that, we can see that the only difference is the "genre" column which is in the COUNT function and not "*" which denotes all the columns in the table.
The result as seen in the image below is not the same as the COUNT for the entire table because that column contains two NULL values:

Apart from the COUNT function, all the other aggregate functions are only used on one column at a time. Following this, let's look at the other functions!!

SUM()

The SUM function was used to illustrate aggregate functions visually in the image at the start of this article, so it's pretty clear that it adds up the values of a column. Unlike the COUNT function, the SUM function can only be used on columns with a numeric data type:

From line 1 in the code snippet above, we can see that the SUM function is applied to the downloads column (a numeric data type column).

AVG()

The AVG function gets the mean of all values of a specified column. The mean of a set of numbers is the sum of all the numbers in that set divided by the number of values (count) in the set.
Same as the SUM function, the AVG function can only be used on numeric columns:

MIN() & MAX()

The MIN and MAX functions are opposites of the same coin in that the MIN function gets the lowest value of a specified column and the MAX function gets the highest value of a specified column. Unlike the other two functions above, the MIN and MAX functions can be used on columns with numerical, date-time, and even character/string data types as seen below:

Let's take a look at some helpful clauses - AS, GROUP BY and ORDER BY clause.
Take a look at this code below:

The result for the snippet of code above is confusing without the code.

Now look at this one:

This one is better, isn't it? 😉
The AS command is used to rename a column or table with an alias (which only exists for the duration of the query).
The result for the snippet is easier to understand with the AS command added in. This can be used for all sorts of queries to make your output easier to understand.

There are cases when aggregate functions does not return a single value per column:

In line 3 above, the GROUP BY clause is introduced. It groups the SUM of the downloads according to the different genres.
The GROUP BY clause groups rows with the same values into summary rows. It is used on categorical columns.
Now let's take a look at the URDER BY clause:

The ORDER BY clause is introduced in line 4 above. It is used to order the output of a column(s) in a table in either ascending (ASC) or descending (DESC) order.

HAVING()

The HAVING clause is used as a conditional statement for aggregate functions or/and arithmetic. It is used with the GROUP BY clause to filter groups or aggregates based on a specific condition(s).
It is very similar to the WHERE clause to filter/restrict the results of a query. However, unlike the WHERE clause, it can only be used with the SELECT statement and must be used with the GROUP BY clause.

In this case, we will see how aggregate functions are used to filter a table using the HAVING clause:

In the snippet above, the GROUP BY clause returns the rows grouped according to the "genre" column and the HAVING clause specifies the condition to filter the groups.

Now let's dive into using aggregate functions as window functions.

Aggregate Functions in Window functions

Window functions are functions that perform operations across a set of rows that are related to the row the function is currently operating on. There are different window functions and they are used to simplify complex operations.
To understand the different window functions and how they are used in SQL, check out Window Functions in SQL.

In this article, we shall look at window functions and aggregate functions.
All the aggregate functions can be used as window functions and they each give awesome and unique results depending on what you are looking for.
Let's look at the SUM() as a window function that gives running totals:

The aggregate window function was used to get the running totals for the number of downloads per genre.

In Line 1, all the columns were selected because aggregate window functions do not return a single value as a result. They behave completely like window functions whilst retaining their computational qualities.
Line 2, is where the aggregate window function SUM() OVER() is introduced as a brand new column named "genre_running_total". This new column is a running total on all the downloads that are split into partitions by their genres and ordered by both the name of the movies and their genres.

Using Aggregate functions either on their own (SUM(), COUNT(), e.t.c) or as a filter (with the HAVING clause) or as a window function (SUM() OVER()) gives different results.
They are very useful and make SQL coding and data presentation as well as analysis a lot easier.

I hope this has answered some of your questions and given you some new ideas!!
I'll be Back 😎
Bye for now.

Window Functions in SQL: Part 1

Onyinyechi Ofondu — Sun, 14 Nov 2021 16:56:40 +0000

In SQL, Window functions are functions that performs operations across a set of rows that are related to the row the function is currently operating on.
Window functions were first introduced in SQL in 2003 with functionality expanded in 2012 and are needed in SQL because they simplify certain complex operations and analysis and can be used to calculate running totals, moving averages, and growth over time amongst others.
The dataset below with a set of rows and columns has a window function operating on a particular column and the result extracted spans an entire new column as shown below.

From the diagram above, we can see a sample dataset with 2 columns (column_1 & column_2). Using the LAG function (which shall be explained later), we can see the result that is produced which spans an entire new column(lag). To get this, all that was needed using window functions was two lines of code as seen below

However, to perform this same operation without window functions in SQL we would need multiple self joins and subqueries.
For this article, we shall be using PostgreSQL and the Pgadmin4 as the GUI which is one of the best Graphical user interface Platforms for PostgreSQL and is very beginner friendly. You can download it here for you PostgreSQL needs.

To start with, we shall look at the basic window functions, which include:
The Ranking functions:- Row number, Rank & Dense rank
The Fetching functions:- Lag, Lead, First_Value & Last_Value.

I created a dataset of Movie downloads which contains certain movie names, genre and number of downloads. This can be created using this SQL statement below.

Using the dataset above, we will look at the basic applications of window functions

Ranking Functions. 🥇🥈🥉

Ranking functions are functions that assign numbers to rows in sequential order. To rank a column in a dataset, things like highest and lowest can be easily seen with a glance and it can be used as a reference (index) for other operations in SQL.
The different ranking functions have the same result with very few differences:-
ROW_NUMBER ranks the different rows starting from number 1. It is used mainly as an index for a dataset and can be used for easier reference to each row.
RANK also does the same as ROW_NUMBER above but assigns the same number(s) to identical values and skips the next value(s) for the number of times the number was repeated.
DENSE_RANK also assigns the same number(s) to identical values but doesn’t skip the next value(s) at all.
The movies dataset can be ranked as is with the 1st movie recorded as the 1st rank and using the ORDER BY function, it can be ranked from the lowest to the highest number of downloads (as shown below) or vice versa:

The OVER clause in lines 1,2 & 3 is a staple in all window functions and determines exactly how the rows of the query are split up for processing by the window function.
The ORDER BY clause in lines 2 & 3 is used inside the window functions' OVER clause specifying that the ORDER should be determined before the function is executed.

Looking at the output above, we can see the differences between the ROW_NUMBER(), the RANK() and the DENSE_RANK().

Let's take a look at the PARTITION BY clause below:

PARTITION BY, a new clause introduced in this snippet above is used to divide the dataset into different partitions (tables/sections). When this happens, any window function executed in the dataset sees each partition as a table. We can see that inside the OVER clause, we have PARTITION BY genre and the ORDER BY clause. In SQL, the code on the inside is executed 1st which means that the table will be partitioned and ordered accordingly before the window function is applied.

🥏🐕 Fetching Functions

The Fetching functions work a bit differently from the ranking functions:
LAG returns the value at n rows before the current row.
LEAD returns the value at n rows after the current row.

In line 1, we can see LAG(downloads, 1): this tells the LAG function to lag the downloads column by 1 by returning the values of the downloads column but skipping the 1st row (pushes the values of the column down one row 😊).
The LEAD function does the same but instead, it starts at the bottom (pushes up) and since its LEAD(downloads, 2) it skips 2 rows as seen below.

The last two window functions in this article are pretty straight forward.
The FIRST_VALUE returns the value of the first row in a table or partition and LAST_VALUE returns the value of the last row in a table or partition.

Nothing new in this snippet above. The FIRST_VALUE clause is pretty easy to code but take a look at this one below:

The LAST_VALUE clause is followed by RANGE BETWEEN. Normally, window functions read from the beginning of the table/partition to the current row the window function is operating on. So it doesn't extend to the end of the table but rather stops at the specified row, but the LAST_VALUE clause starts at the bottom of the table so the RANGE BETWEEN is used to extend the window function to the end of the table.

Window functions make SQL life much easier and can be used in different ways. Now go grab yourself a dataset and get to work 🔥🔥