loading...
Cover image for Why you need SQL window functions (part 1)

Why you need SQL window functions (part 1)

helenanders26 profile image Helen Anderson Originally published at helenanderson.co.nz ・4 min read

Data Analysis (14 Part Series)

1) Speed up your queries with indexes 2) Resources for beginner data analysts 3 ... 12 3) SQL concepts from A to Z 4) 6 ways to search with System Tables 5) Pearl Jam, Nirvana and Beautiful Data Visualisations 6) Entity Relationship Diagrams explained by Sonic the Hedgehog 7) How to use the VLOOKUP Excel function 8) SQL Joins without the Venn diagrams 9) SQL aggregate functions and Mariah Carey 10) Why I use SQL CTEs 11) How to customise SQL Server Management Studio (SSMS) 12) Have you VACUUMed your tables lately? 13) Why you need SQL window functions (part 1) 14) Why you need SQL window functions (part 2)

Welcome to the world of window functions.

Window functions are closely related to aggregate functions. But rather than collapsing all the rows we want into one row, we keep them. Then we add a new column with a running total, rank or moving average. This becomes our ‘window frame.’

It becomes clearer with examples and pictures, so if you’re interested in using this powerful tool, keep reading.


1 – Introduction
2 – Syntax
3 – Setting the scene
4 – Create a running total
5 – RANK rows based on a given criteria


Introduction

Window functions come in three main types. These are:

Aggregate Window Functions

These use aggregate functions like SUM, COUNT, MAX, MIN over a set of rows and return a single result from the query.

Ranking Window Functions

These assign a ‘rank’ to a set of rows and using RANK, DENSE_RANK, ROW_NUMBER, NTILE

Value Window Functions

These use LAG, LEAD, FIRST_VALUE, LAST_VALUE to access a previous row without having to do a self-join.


Syntax

To get started with Window Functions lets begin with the syntax for an aggregate window function and how each part works.

You will need:

  • The function you want to perform: AVG, SUM, COUNT
  • An indication you want to use this Function over multiple rows: OVER
  • How you want to group your rows – PARTITION
  • How you want to order your rows – ORDER BY
  • To make sense of the syntax and show the use cases I’m going to tackle some real-life problems faced by a Sales Analyst.

Setting the scene

In this example, the Sales Manager has come to us with a request.

She is setting targets for the team for next year and needs historic data for her decision making. We have data available in the sales database but need it in a format that’s easier to use than the raw table.


Alt Text


How to Create a Running Total

First, she’d like to see daily sales totals but also wants to be able to see each orderid, so she can ‘drill down’ if need be.

We could tackle this using SUM to total up all the rows on the Orders table. But this collapses down the order details. A window function will allow us to see each order with a running total for the Sales Manager.

select
  sale_date,
  salesorderid,
  subtotal,
  sum(subtotal) over(partition by sale_date order by salesorderid) as total_sales
from 
  sales.salesorderheader
where 
  orderdate between 
'2018-01-01 00:00:00:000' and '2018-12-31 00:00:00:000'
order by sale_date

By using a window function, we can see each order for each day, and it’s total, with a running total along the side. Other aggregate functions work the same way, so you can use COUNT, AVG, MIN or MAX, or in combination.


Alt Text


How to RANK rows based on a given criteria

The Sales Manager is back.

She was happy with the table created for setting targets but now needs some strategies to increase sales. This time she wants to see sales for 2018 by the customer and dollar value. If she can spot when those big sales were in the past, maybe she could go knock on their door again?

To give her the data she needs we need to use a Window Function with a RANK.

select
  sale_date,
  salesorderid,
  subtotal,
  rank() over(order by subtotal desc) as sales_rank
from 
  sales.salesorderheader
where 
  orderdate between 
'2018-01-01 00:00:00:000' and '2018-12-31 00:00:00:000'
order by sale_date

Using RANK() in a window function ranks each row, in this case, by subtotal. The Sales Manager can now decide if she wants to target those customers again in the coming year.


Alt Text


If there are identical rows, like in the example above, they all receive the same rank and places 2 and 3 are skipped. To get around this we can use DENSE_RANK which would have kept rank 2 and 3 before moving on to fourth place.


There’s much more to learn and useful ways to use window functions, so check out part two in this series.


Read more


This post originally appeared on helenanderson.co.nz

Data Analysis (14 Part Series)

1) Speed up your queries with indexes 2) Resources for beginner data analysts 3 ... 12 3) SQL concepts from A to Z 4) 6 ways to search with System Tables 5) Pearl Jam, Nirvana and Beautiful Data Visualisations 6) Entity Relationship Diagrams explained by Sonic the Hedgehog 7) How to use the VLOOKUP Excel function 8) SQL Joins without the Venn diagrams 9) SQL aggregate functions and Mariah Carey 10) Why I use SQL CTEs 11) How to customise SQL Server Management Studio (SSMS) 12) Have you VACUUMed your tables lately? 13) Why you need SQL window functions (part 1) 14) Why you need SQL window functions (part 2)

Posted on May 26 by:

helenanders26 profile

Helen Anderson

@helenanders26

Making applications go faster at Raygun, AWS Data Hero, and tag moderator on Dev.to. Database concept you don’t understand? Let me know, I’ll write a post!

Discussion

markdown guide
 

You explicitly used

between '2018-01-01 00:00:00:000' and '2018-12-31 00:00:00:000'

in your query but it makes it error prone if your datetime isnt as accurate in milliseconds as you shown. You should better use for example:

(Your predicate) >= 2018-01-01 and < 2018-12-31

and this will guaranty that it will only take all the rows from the start and only the rows till the end.
So always use an open-ended range to prevent erroneously including or excluding rows. It’s much less complex to find the beginning of the next period than the end of the current period.

 

Oh, what would I do without window functions?

 

There's always more in SQL. I don't use it much these days, but whenever I look at your posts there's something I didn't know.