You may think SQL window functions are similar to aggregating but it actually defines a window of rows with a given length around the current row.
You can do,
- aggregate operations: SUM, COUNT, MAX, MIN
- ranking: RANK, DENSE_RANK, ROW_NUMBER, NTILE
- value: LAG, LEAD, FIRST_VALUE, LAST_VALUE
Let's give a concrete example to understand what exactly a window function is doing.
Consider you are a data scientist working on a classification task and you want to predict if an event will occur within the next three hours. In your data, the label is 1 if the event occurred at that exact time and 0 otherwise. Therefore you need a backfill operation to build the target. You can do it by using SQL window functions. Let's create the data frame first,
We don't need to create a database, we will use the pandasql
library for executing SQL queries over pandas
data frames (download via pip install pandas
and pip install pandasql
).
# import libraries
import pandas as pd
data = [
("2020-01-01 06:00:00", 0),
("2020-01-01 07:00:00", 0),
("2020-01-01 08:00:00", 0),
("2020-01-01 09:00:00", 0),
("2020-01-01 10:00:00", 0),
("2020-01-01 11:00:00", 1),
("2020-01-01 12:00:00", 0),
("2020-01-01 13:00:00", 0),
("2020-01-01 14:00:00", 0),
("2020-01-01 15:00:00", 0),
]
# create data frame
df = pd.DataFrame(data, columns=["datetime", "event"])
df.head(10)
Here is the dataframe:
Let's solve the problem:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
solution = pysqldf(
"""
SELECT datetime, event, MAX(event) OVER
(
ORDER BY datetime ASC
ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING
) as target
FROM df
"""
)
solution.head(10)
We ordered the rows using datetime
column in ascending order and filled the window rows with the MAX value. This is the data frame we get with the target column:
We labeled the previous 3 rows before 11am in the target
column.
Top comments (0)