DEV Community

Cover image for The Ultimate Guide to Python Window Functions
StrataScratch
StrataScratch

Posted on • Originally published at stratascratch.com

The Ultimate Guide to Python Window Functions

This article focuses on different types of Python window functions, where and how to implement them, practice questions, reference articles and documentation.

Window function is a popular technique used to analyze a subset with related values. It is commonly used in SQL, however, these functions are extremely useful in Python as well.

If you would like to check out our content on SQL Window Functions, we have also created an article "The Ultimate Guide to SQL Window Functions" and a YouTube video!

This article discusses:

  • Different types of window functions

  • Where / How to implement these functions

  • Practice Questions

  • Reference articles / Documentation

A general format is written for each of these functions for you to understand and implement on your own. The format will include bold italicized text which indicate these are the sections of the function you need to replace during implementation.

For example:
dataframe.groupby(level='groupby_column').agg({‘aggregate_column’: ‘aggregate_function’})

Texts such as 'dataframe' and 'groupby_column' are bold and italicized meaning you should replace them with the actual variables.
Texts such as ‘.groupby’ and ‘level’ which are not bold and italicized are required to remain the same to execute the function.

Let’s suppose Amazon asks to find the total cost each user spent on their amazon orders.
An implementation of this function this dataset would look similar to this:
amazon_orders.groupby(level='user_id').agg({'cost': 'sum'})

Table of Contents

  • Python Window Functions overview diagram

  • Aggregate
    Group by
    Rolling
    Expanding

  • Ranking
    Row number

    reset_index()
    cumcount()

    Rank

    default_rank
    min_rank
    NA_bottom
    descending

    Dense rank
    Percent rank
    N-Tile / qcut()

  • Value
    Lag / Lead
    First / Last / nth value

Python Window Functions

Types of Python Window Functions

While there is not any official classification of Python window functions, these are the common functions implemented.

Aggregate

Aggregate python window functions

These are some common types of aggregate functions

  • Average
  • Max
  • Min
  • Sum
  • Count

Each of these aggregate functions (except count which will be explained later) can be used in three types of situations

  • Group by
  • Rolling
  • Expanding

Example

  • Group by: Facebook is trying to find the average revenue of Instagram for each year.
  • Rolling: Facebook is trying to find the rolling 3 year average revenue of Instagram
  • Expanding: Facebook is trying to find the cumulative average revenue of Instagram with an initial size of 2 years.

Group by

Group by aggregates is computing a certain column by a statistical function within each group. For example in a dataset

Let’s use a question from Amazon to explain this topic. This question is asking us to calculate the percentage of the total expenditure a customer spent on each order. Output the customer’s first name, order details (product name), and percentage of the order cost to their total spend across all orders.

Remember when approaching questions follow the 3 steps

  1. Ask clarifying questions
  2. State assumptions
  3. Attempt the question

When approaching these questions, understand which columns need to be grouped and which columns need to be aggregated.
For the Amazon example,
Group by: customer first_name, order_id, order_details
Aggregate: total_order_cost

In this question, there are 2 tables which need to be joined to get the customer’s first name, item, and spending. After merging both tables and filtering to get the required columns, to get the following dataset

Aggregate Group by Python window functions

Once necessary data is set in a single table, it is easier to manipulate.

Here we can find the total spending by person by grouping first_name and sum of total_order_cost

This is a general format on how to group by and aggregate the required columns.
dataframe.groupby(level='groupby_column').agg({'aggregate_column': 'aggregate_function'})

In reference to the Amazon example, this is the executing code.

total_spending = customer_orders.groupby("first_name").agg({'total_order_cost' : 'sum'})

This code will output the following dataframe

Output for Aggregate Python window function Question

After this, we want to add a column to the merged data frame to represent total spending by each person.

Let’s join both dataframes on the person’s first_name

pd.merge(merged_dataframe, total_spending, how="left", on="first_name")

Now we get the following dataset

Output 2 for Aggregate Python window function Question

As seen, the total_order_cost_y column represents the total spending per person and the total_order_cost_x to represent the cost per order. After this, this is a simple division of 2 columns to create the percentage of the spending column AND filtering the output to get the required columns.

result = df3[["first_name", "order_details", "percentage_total_cost"]]

Output 3 for Aggregate Python window function Question

However for certain situations, it is required to sort the values within each group. This is where the sort_values() function is implemented.

Referencing the amazon question example:

Suppose the interviewer asks to order the percentage_total_cost in descending order by person.

result = result.sort_values(by=['first_name', 'percentage_total_cost'], ascending = (True, False))

import pandas as pd
df = pd.merge(orders, customers, left_on="cust_id", right_on="id")
df1 = df[["first_name", "id_x", 'order_details', 'total_order_cost']]
df2 = df.groupby("first_name").agg({'total_order_cost' : 'sum'})
df3 = pd.merge(df1, df2, how="left", on="first_name")
df3["percentage_total_cost"] = df3["total_order_cost_x"] / df3["total_order_cost_y"]
result = df3[["first_name", "order_details", "percentage_total_cost"]]
result = result.sort_values(by=['first_name', 'percentage_total_cost'], ascending = (True, False))

Practice

Reference

Rolling vs Expanding Function

Before diving into how to execute a rolling or expanding function, let’s understand how each of these functions works. While rolling function and expanding function work similarly, there is a significant difference in the window size. Rolling function has a fixed window size, while the expanding function has a variable window size.

These images explain the difference between rolling and expanding functions.

Rolling Function

Rolling Function

Expanding Function

Expanding Function

Rolling and expanding functions both start with the same window size, but expanding function incorporates all the subsequent values beyond the initial window size.

Example: AccuWeather, a weather forecasting company, is trying to find the rolling and expanding average 10 day weather of San Francisco in January.

Rolling: Starting with a window size of 10, we take the average temperature from January 1st to January 10th. Next we take January 2nd to January 11th and so on. This shows the window size in rolling functions remains the same.

Expanding: Starting with a window size of 10, we take the average temperature from january 1st to January 10th. However, next we’ll take the average temperature from January 1st to January 11th. Then, January 1st to January 12th and so on. Therefore the window size has “expanded”.

While there are many aggregate functions that can be used in rolling/expanding functions, this article will discuss the frequently used functions (sum, average, max, min).

This brings us to the reason why the count function is not used in rolling and expanding functions. Count is used when a certain variable is grouped and there is a need to count the occurrence of a value. In the rolling and expanding function, there is no grouping of rows, but a calculation on a specific column.

Rolling Aggregate

Implementation of rolling functions are straightforward.

A general format:

DataSeries.rolling(window_size).aggregate_function()

Example:
Temperature of San Francisco of the first 22 days of 2021. Let’s find the average, sum, maximum, and minimum of a 5 day rolling time period.

weather['Average'] = weather['Temperature'].rolling(5).mean()
weather['Sum'] = weather['Temperature'].rolling(5).sum()
weather['Max'] = weather['Temperature'].rolling(5).max()
weather['Min'] = weather['Temperature'].rolling(5).min()

Output for Rolling Aggregate Python window function Question

After row 4, the rolling function over a fixed window size of 5 calculates the average, sum, max, and min over the Temperature values. This means that in the average column in row 16, calculates the average of rows 12, 13, 14, 15, and 16.

As expected, the first 4 values of the rolling function columns are null due to not having enough values to calculate. Sometimes you still want to calculate the aggregate of the first n rows even if it doesn’t fit the number of required rows.

In that case, we have to set a minimum number of observations to start calculating. Within the rolling function, you can specify the min_periods.

DataSeries.rolling(window_size, min_periods=minimum_observations).aggregate_function()

weather['Average'] = weather['Temperature'].rolling(5, min_periods=1).mean()
weather['Sum'] = weather['Temperature'].rolling(5, min_periods=2).sum()
weather['Max'] = weather['Temperature'].rolling(5, min_periods=3).max()
weather['Min'] = weather['Temperature'].rolling(5, min_periods=3).min()

Output 2 for Rolling Aggregate Python window function Question

Practice

Reference

Expanding Aggregate

Expanding function has a similar implementation to rolling functions.

DataSeries.expanding(minimum_observations).aggregate_function()

It is important to remember that unlike the rolling function, the expanding function does not set a window size, due to its variability. The minimum_observations is specified, so for rows less than the minimum_observations will be set as null.

Let’s use the same San Francisco Temperature example to explain expanding function

weather['Average'] = weather['Temperature'].expanding(5).mean()
weather['Sum'] = weather['Temperature'].expanding(5).sum()
weather['Max'] = weather['Temperature'].expanding(5).max()
weather['Min'] = weather['Temperature'].expanding(5).min()

Output for Expanding Aggregate Python window function Question

As it can be seen through the minimum temperature column, it takes the minimum value throughout the dataset, since it’s expanding beyond the minimum observations set.

import pandas as pd
weather = pd.DataFrame({'Temperature': [57, 54, 60, 54, 57, 58, 52, 52, 59, 54, 53, 57, 56, 60, 55, 58, 59, 64, 65, 66, 67, 74]})
weather['Rolling_Average'] = weather['Temperature'].rolling(5).mean()
weather['Rolling_Sum'] = weather['Temperature'].rolling(5).sum()
weather['Rolling_Max'] = weather['Temperature'].rolling(5).max()
weather['Rolling_Min'] = weather['Temperature'].rolling(5).min()
weather[‘Rolling_Average_minperiod’] = weather['Temperature'].rolling(5, min_periods=1).mean()
weather['Rolling_Sum_minperiod'] = weather['Temperature'].rolling(5, min_periods=2).sum()
weather['Rolling_Max_minperiod'] = weather['Temperature'].rolling(5, min_periods=3).max()
weather['Rolling_Min_minperiod'] = weather['Temperature'].rolling(5, min_periods=3).min()
weather['Expanding_Average'] = weather['Temperature'].expanding(5).mean()
weather['Expanding_Sum'] = weather['Temperature'].expanding(5).sum()
weather['Expanding_Max'] = weather['Temperature'].expanding(5).max()
weather['Expanding_Min'] = weather['Temperature'].expanding(5).min()
weather

Reference

Ranking

Ranking python window functions

Row Number

Counting the number of rows can be executed in 2 different situations each with a different function

  1. Across the entire dataframe - reset_index()
  2. Within groups - cumcount()

These are the equivalent functions as row_number() in SQL

Let’s use the following sample dataset to explain both concepts

Ranking Row Number

Reset_index

Within a dataframe, reset_index() will output the row number of each row.

General format to follow:
dataframe.reset_index()

Reset_index

To extract the nth row implement the .iloc() function

Dataframe.iloc[nth_row]

cumcount()

To calculate the row number within groups of a dataframe, you have to implement the cumcount() function in the following format

dataframe.groupby([‘column_names’]).cumcount()

Also remember to start the row count from 1 instead of the default 0, you need to add +1 to the cumcount() function

For the sample dataset, the implementation would be

df['Row_count'] = df.groupby(['c1', 'c2']).cumcount()+1

This would be the output

cumcount

Now that you have the row_count within each group, sometimes you have to extract a specific index row of each group.

For example, the company asks to extract the 2nd indexed value within each group. We can extract this by returning each row with a row_count value of 2.

Using iloc again, we can extract the subset with the following general format

Dataframe.loc[dataframe[column_name] == index]

For the column dataset above, we would use

df.loc[df['Row_count'] == 2]

to get the subset

Python Window Functions subset

import pandas as pd
df = pd.DataFrame({'c1': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'], 'c2':['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X'], 'v1':[3, 5, 7, 1, 3, 1, 3, 1, 7, 4, 1, 6]})
df['Row_count'] = df.groupby(['c1', 'c2']).cumcount()+1
df.loc[df['Row_count'] == 2]

Questions:

Reference:

Rank

Ranking functions as the name states ranks values based on a certain variable. Ranking function works slightly differently than its SQL equivalent.

Rank() function can be executed with the following general format

dataframe[column_name].rank()

Let’s assume the following dataset from the Pandas ranking documentation.

Rank python window functions

And create 4 new columns which both use the rank() function to explain the function and its most popular parameters better

animal_legs['default_rank'] = animal_legs['Number_legs'].rank()
animal_legs['min_rank'] = animal_legs['Number_legs'].rank(method='min')
animal_legs['NA_bottom'] = animal_legs['Number_legs'].rank(method='min', na_option='bottom')
animal_legs['descending'] = animal_legs['Number_legs'].rank(method='min', ascending = False)

Python Window Functions

In this function we’re ranking the number_legs for each animal.

Let’s understand what each of the columns represents.

‘default_rank’

In a default rank() function, there are 3 important things to note.

  • Ascending order is assumed true
  • Null values are not ranked and left as null
  • If n values are equal, the rank split is averaged between the values.

The n values rank splitting is a bit confusing, so let’s dive more into this to explain it better.

In SQL for the dataset above, since both cat and dog both have 4 legs, it would assume both as rank = 2 and spider with the next highest number of legs would have a rank of 4.

Instead of that, Pandas averages out the ‘would have been’ ranks between cat and dog.
There should be a rank of 2 and 3, but since cat and dog have the same value, the rank is the average of 2 and 3, which is 2.5

Let’s alter the animal's example to include ‘donkey’ which has 4 legs.

default_rank python window function

Penguin has the least number of legs with 2, so it has a rank = 1.

Since cat, dog, and donkey all have the next highest count of 4 legs, it will take the average of 2,3,4, due to 3 animals with the same value.

If we have 4 animals all with 4 legs, it will take the average of 2,3,4,5 = 3.5

‘min_rank’

When setting the parameter method=’min, instead of taking the average ranked value, it will take the minimum rank between equal values.

The minimum rank is the same as how the rank function in SQL works.

Using the animals example, the rank between dog and cat will now be 2 instead of 2.5.
And for the example with donkey, it will still assume a rank of 2, while spider will be set to a rank of 5.

min_rank python window function

‘NA_bottom’

Certain rows contain null values and under default conditions, the rank will also be set as null. In certain cases you would want the null values to rank the lowest or highest.

Setting the na_option as bottom would give the highest ranked value and setting as top would give it the lowest ranked value

In the animals example, we set null values as bottom and rank method as minimum

NA_bottom python window function

‘descending’

If you want to set the rank in descending order, set the parameter ascending as false.

Referring to the animals example, we set ascending to false and method as minimum

descending python window function

Questions:

Reference:

Dense Rank

Dense rank is similar to a normal rank with a slight difference.

During a normal rank function, ranking numbers may be skipped, while dense_rank doesn’t skip.

Dense Rank Python Window Function

For example in the animals dataframe, after [dog, cat, donkey], spider was the next value. In minimum rank, it sets spiders as rank = 5, since 2,3,4 are technically set for cat,dog, and donkey.

In a dense rank, it will set the immediate consecutive ranks as seen above. Instead of 5th rank, spider was set to 3rd rank in dense_rank.

Fortunately, you just have to edit the method parameter in a rank function to get the dense rank

animal_legs['dense_rank'] = animal_legs['Number_legs'].rank(method='dense')

All the other parameters, such as na_option and ascending, can also be set alongside the dense method as mentioned before.

Questions:

Reference:

Percent rank (Percentile)

Percent rank python window function

Percent rank python window function
Percent rank is just a representation of the ranks compared to the highest rank.

As seen in the animals dataframe above, spider has a rank of 5 for both default_rank and min_rank. Since 5 is the highest rank, the other values would be compared to this.
For cat in default_rank, it has a value of 3, and 3 / 5 = 0.6 for default_pct_rank
For cat in min_rank, it has a value of 2, and 2 / 5 = 0.4 for min_pct_rank

Percentage rank is boolean parameter which can be set

animal_legs['min_pct_rank'] = animal_legs['Number_legs'].rank(method='min', pct=True)

Questions:

Reference:

import pandas as pd
import numpy as np
animal_legs = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog', 'spider', 'snake', 'donkey'], 'Number_legs': [4, 2, 4, 8, np.nan, 4]})
animal_legs['default_rank'] = animal_legs['Number_legs'].rank()
animal_legs['min_rank'] = animal_legs['Number_legs'].rank(method='min')
animal_legs['NA_bottom'] = animal_legs['Number_legs'].rank(method='min', na_option='bottom')
animal_legs['descending'] = animal_legs['Number_legs'].rank(method='min', ascending = False)
animal_legs['dense_rank'] = animal_legs['Number_legs'].rank(method='dense')
animal_legs['default_pct_rank'] = animal_legs['Number_legs'].rank(pct=True)
animal_legs['min_pct_rank'] = animal_legs['Number_legs'].rank(method='min', pct=True)
animal_legs

N-Tile / qcut()

qcut() is not a popular function, since ranking based on quantiles beyond percentiles are not as common. While it isn’t as popular, it is still an extremely powerful function!
If you don’t know the relationship between quantiles and percentiles, check out this article by statology!

Let’s take a question from DoorDash, to explain how qcut is used.

The question asks us to find the bottom 2% of the dataset, which is the first quantile from a 50-quantile split.

A general format to follow when using qcut():
pd.qcut(dataseries, q=number_quantiles, labels = range(lower_bound, upper_bound))

N-Tile python window function

This is a subset of the dataset which we will use to analyze the usage of the qcut() function.

  • dataseries → The column to analyze, which is total_order in this example
  • number_quantiles → Number of quantiles to split by, which is 50 due to 50-quantile split
  • labels → Range of ntiles, which 1-50 in this case. However, the upper bound is calculated as n-1. So if we set a range of 1-50, the highest ntile will be 49 instead of 50. Due to this, we set our upper bound as n+1, which in this example would be range(1,51)

For this example, this would be the following code.

result[‘ntile’] = pd.qcut(result['total_order'],q=50, labels=range(1, 50))

N-Tile python window function example

As seen in the example, ‘ntile’ has been split and represents the quantile.

It must also be noted that if the label range is not specified, the quantile range is returned.

For example executing the same code above without the labels range:

result[‘ntile_range’] = pd.qcut(result['total_order'],q=50)

N-Tile python window function example 2

import pandas as pd
result = doordash_delivery[doordash_delivery['customer_placed_order_datetime'].between('2020-05-01', '2020-05-31')].groupby("restaurant_id")["order_total"].sum().to_frame('total_order').reset_index()
result['ntile'] = pd.qcut(result['total_order'],q=50, labels=range(1, 50), duplicates = 'drop').values.tolist()
result['ntile_range'] = pd.qcut(result['total_order'],q=50, duplicates = 'drop').values.tolist()
result

Questions:

Reference:

Value

Lag / Lead

Lag and Lead functions are used to represent another column but are shifted by a single or multiple rows.

Let’s use a dataset given by Facebook (Meta) which represents the total cost of orders by each month.

Value Lag Lead python window function

In the ‘Lag’ column, we can see that values were shifted down by one. 305 which is the total_order_cost for January, appears in the ‘Lag’ column but on the same row as February.

In the ‘Lead’ column, the opposite occurs. Rows are shifted up by one, so 285 which is the total_order_cost for February appears in the ‘Lead’ column in January.

This makes it easier to calculate comparing values side by side such as growth of sales by month.

A general format to follow:
dataframe[‘shifting_column’].shift(number_shift)

Code used for the data:

orders['Lag'] = orders['total_order_cost'].shift(1)
orders['Lead'] = orders['total_order_cost'].shift(-1)

Also another key point to remember is the null values that are present due to the shift. As seen there are 1 null values (NaN) in the Lag and Lead column, since the values have been shifted by 1. There will be n rows of null values, due to the data series being shifted by n rows. So for the first n rows of ‘Lag’ column and last n rows of ‘Lead’ column will be null values.
If you want to replace the null values that are generated by the shift, you can use the fill_value parameter.

We execute the code with updated parameters

orders['Lag'] = orders['total_order_cost'].shift(1, fill_value = 0)
orders['Lead'] = orders['total_order_cost'].shift(-1, fill_value = 0)

To get this as the output

Lag Lead window function

import pandas as pd
import numpy as np
orders['order_date'] = orders['order_date'].apply(pd.to_datetime)
orders['order_month'] = orders['order_date'].dt.month
orders.loc[(orders.order_month == 1),'order_month'] = 'January'
orders.loc[(orders.order_month == 2),'order_month'] = 'February'
orders.loc[(orders.order_month == 3),'order_month'] = 'March'
orders.loc[(orders.order_month == 4),'order_month'] = 'April'
orders['order_month'] = pd.Categorical(orders['order_month'], ["January", "February", "March", "April"])
orders = orders[['order_month', 'total_order_cost']]
orders = orders.sort_values(by=['order_month'])
orders = orders.groupby("order_month").agg({'total_order_cost' : 'sum'})
orders['Lag'] = orders['total_order_cost'].shift(1, fill_value = 0)
orders['Lead'] = orders['total_order_cost'].shift(-1, fill_value = 0)
orders

Questions:

Reference:

First/Last/nth value

Finding the nth value (including first and last) within groups of a dataset is fairly simple with Python as well.

Let’s use the same orders dataset by Facebook used in the Lag/Lead section.

First Last and nth value python window functions

As seen here, the order_date has been ordered from earliest to latest.

Let’s find the first order of each month using the nth() function

General format:
dataframe.groupby(‘groupby_column’).nth(nth_value)

nth_value represents the indexed value
nth_value in the nth() function works the same way as extracting the nth_value in a list.
0 represents the first value
-1 represents the last value

Using the following code:
orders.groupby('order_month').nth(0)

First Last and nth value python window functions output

To return only a specific column, such as total_order_cost, you can specify this as well.

orders.groupby('order_month').nth(0)['total_order_cost'].reset_index()

First Last and nth value python window functions output 2

Now if you want to join the nth value to the respective grouped by columns in the original dataframe, you could use a merge function, similar to how the merge function was applied in the aggregate functions mentioned above. Make sure you remember to extract the column to merge on as well! In this example, it would be the ‘order_month’ index column, where you should use the reset_index() function.

import pandas as pd
orders['order_date'] = orders['order_date'].apply(pd.to_datetime)
orders['order_month'] = orders['order_date'].dt.month
orders = orders.sort_values(by=['order_date'])
ordered_group = orders.groupby('order_month').nth(0)['total_order_cost'].reset_index()

Reference:

Practice Questions Compiled

Aggregate
Group by
Rolling
Expanding
  • [[[ No Questions ]]]
Ranking
Row_number()
rank()
dense_rank()
percent_rank()
ntile() / qcut()
Value
Lag/Lead
First / Last / nth_value()
  • [[[ No Questions ]]]

Top comments (0)