You are going to hit a problem.
Not now. A few weeks into data analysis work, or maybe during a job interview, someone will ask you something like:
"For each sale, can you show the sale amount alongside the running total for that salesperson?"
Or: "Rank employees by salary within each department without removing any rows."
Or: "For each order, show the previous order's amount for comparison."
You will try GROUP BY. It collapses rows. Wrong.
You will try joins. Awkward, multiple CTEs. Messy.
You will try Python. Works, but you had to pull all the data out first.
Then someone tells you about window functions and you feel both relieved and slightly cheated that nobody mentioned these sooner.
This is that mention.
The Core Idea in One Sentence
Window functions compute a value for each row using other rows, without collapsing the result into groups.
GROUP BY: 8 rows go in, 3 rows come out.
Window function: 8 rows go in, 8 rows come out. Each row now has a new computed value.
That distinction is everything.
Setup
import sqlite3
import pandas as pd
conn = sqlite3.connect("window_demo.db")
conn.executescript("""
DROP TABLE IF EXISTS sales;
CREATE TABLE sales (
sale_id INTEGER PRIMARY KEY,
rep_name TEXT,
region TEXT,
amount REAL,
sale_date TEXT,
product TEXT
);
INSERT INTO sales VALUES
(1, 'Alex', 'North', 45000, '2024-01-05', 'Laptop'),
(2, 'Priya', 'South', 32000, '2024-01-08', 'Phone'),
(3, 'Alex', 'North', 12000, '2024-01-12', 'Watch'),
(4, 'Sam', 'East', 28000, '2024-01-15', 'Tablet'),
(5, 'Priya', 'South', 75000, '2024-01-18', 'Laptop'),
(6, 'Alex', 'North', 8500, '2024-01-22', 'Headphones'),
(7, 'Sam', 'East', 45000, '2024-02-01', 'Laptop'),
(8, 'Priya', 'South', 18000, '2024-02-05', 'Watch'),
(9, 'Alex', 'North', 62000, '2024-02-10', 'Laptop'),
(10, 'Sam', 'East', 15000, '2024-02-14', 'Headphones'),
(11, 'Priya', 'South', 41000, '2024-02-18', 'Tablet'),
(12, 'Alex', 'North', 33000, '2024-02-22', 'Phone');
""")
conn.commit()
def q(sql):
return pd.read_sql_query(sql, conn)
The Syntax First
Every window function uses the OVER() clause. That is what makes it a window function instead of a regular aggregate.
function_name() OVER (
PARTITION BY column -- optional: like GROUP BY for the window
ORDER BY column -- optional: defines row order within window
ROWS/RANGE BETWEEN ... -- optional: defines which rows to include
)
Without PARTITION BY: the window is the entire table.
With PARTITION BY: separate windows per group. Like GROUP BY but rows stay.
Aggregate Functions as Window Functions
Every aggregate you know (SUM, AVG, COUNT, MIN, MAX) can become a window function by adding OVER.
result = q("""
SELECT
rep_name,
sale_date,
amount,
SUM(amount) OVER () AS company_total,
AVG(amount) OVER () AS company_avg,
SUM(amount) OVER (PARTITION BY rep_name) AS rep_total,
AVG(amount) OVER (PARTITION BY rep_name) AS rep_avg,
ROUND(100.0 * amount /
SUM(amount) OVER (PARTITION BY rep_name), 1) AS pct_of_rep_total
FROM sales
ORDER BY rep_name, sale_date;
""")
print(result.round(1))
Output:
rep_name sale_date amount company_total company_avg rep_total rep_avg pct_of_rep_total
0 Alex 2024-01-05 45000 414500.0 34541.7 160500.0 40125.0 28.0
1 Alex 2024-01-12 12000 414500.0 34541.7 160500.0 40125.0 7.5
2 Alex 2024-01-22 8500 414500.0 34541.7 160500.0 40125.0 5.3
3 Alex 2024-02-10 62000 414500.0 34541.7 160500.0 40125.0 38.6
4 Alex 2024-02-22 33000 414500.0 34541.7 160500.0 40125.0 20.6
5 Priya 2024-01-08 32000 414500.0 34541.7 166000.0 41500.0 19.3
...
Every single row is still there. No collapsing. But each row now shows:
- The company-wide total alongside the individual sale
- Each rep's total alongside each individual sale
- What percentage of that rep's revenue this one sale represents
This is impossible with GROUP BY. With window functions, one query.
Running Totals with ORDER BY
Add ORDER BY inside OVER() and the window becomes cumulative.
result = q("""
SELECT
rep_name,
sale_date,
amount,
SUM(amount) OVER (
PARTITION BY rep_name
ORDER BY sale_date
) AS running_total
FROM sales
ORDER BY rep_name, sale_date;
""")
print(result)
Output:
rep_name sale_date amount running_total
0 Alex 2024-01-05 45000 45000.0
1 Alex 2024-01-12 12000 57000.0
2 Alex 2024-01-22 8500 65500.0
3 Alex 2024-02-10 62000 127500.0
4 Alex 2024-02-22 33000 160500.0
5 Priya 2024-01-08 32000 32000.0
6 Priya 2024-01-18 75000 107000.0
...
The running total accumulates within each rep's partition in chronological order. Alex starts at 45000, adds 12000 to get 57000, and so on. Priya's running total starts fresh at 32000 because it is a separate partition.
ROW_NUMBER, RANK, DENSE_RANK
Three ranking functions. They look similar and produce different results when there are ties.
result = q("""
SELECT
rep_name,
amount,
ROW_NUMBER() OVER (ORDER BY amount DESC) AS row_num,
RANK() OVER (ORDER BY amount DESC) AS rank_val,
DENSE_RANK() OVER (ORDER BY amount DESC) AS dense_rank_val
FROM sales
ORDER BY amount DESC;
""")
print(result)
Output:
rep_name amount row_num rank_val dense_rank_val
0 Priya 75000 1 1 1
1 Alex 62000 2 2 2
2 Alex 45000 3 3 3
3 Sam 45000 4 3 3
4 Priya 41000 5 5 4
5 Sam 28000 6 6 5
...
Alex and Sam both have 45000. Watch what happens:
ROW_NUMBER: assigns 3 and 4. No ties, just sequential numbering. Arbitrary which gets 3.
RANK: assigns 3 and 3, then skips to 5. The skip represents the gap where 4 would have been.
DENSE_RANK: assigns 3 and 3, then continues with 4. No gaps.
When to use which: ROW_NUMBER when you want exactly N rows with no duplicates (top 3 strictly). RANK for sports-style rankings where ties share a rank and the next rank skips. DENSE_RANK when you want tied entries to share a rank without creating gaps downstream.
Ranking Within Groups
The real power: rank within partitions.
result = q("""
SELECT
rep_name,
region,
amount,
sale_date,
RANK() OVER (
PARTITION BY rep_name
ORDER BY amount DESC
) AS rank_within_rep,
RANK() OVER (
PARTITION BY region
ORDER BY amount DESC
) AS rank_within_region
FROM sales
ORDER BY rep_name, rank_within_rep;
""")
print(result)
Output:
rep_name region amount sale_date rank_within_rep rank_within_region
0 Alex North 62000 2024-02-10 1 1
1 Alex North 45000 2024-01-05 2 2
2 Alex North 33000 2024-02-22 3 3
3 Alex North 12000 2024-01-12 4 4
4 Alex North 8500 2024-01-22 5 5
5 Priya South 75000 2024-01-18 1 1
...
Each rep ranked within their own results. Each rep also ranked within their region (which happens to be the same here since each rep owns one region). In a real dataset with multiple reps per region, these would differ.
LAG and LEAD: Looking at Adjacent Rows
LAG looks at the previous row. LEAD looks at the next row. Both without any joins.
result = q("""
SELECT
rep_name,
sale_date,
amount,
LAG(amount) OVER (PARTITION BY rep_name ORDER BY sale_date) AS prev_sale,
LEAD(amount) OVER (PARTITION BY rep_name ORDER BY sale_date) AS next_sale,
amount - LAG(amount) OVER (
PARTITION BY rep_name ORDER BY sale_date
) AS change_from_prev
FROM sales
ORDER BY rep_name, sale_date;
""")
print(result)
Output:
rep_name sale_date amount prev_sale next_sale change_from_prev
0 Alex 2024-01-05 45000 NaN 12000.0 NaN
1 Alex 2024-01-12 12000 45000.0 8500.0 -33000.0
2 Alex 2024-01-22 8500 12000.0 62000.0 -3500.0
3 Alex 2024-02-10 62000 8500.0 33000.0 53500.0
4 Alex 2024-02-22 33000 62000.0 NaN -29000.0
The first row has no previous sale so prev_sale is NULL. The last row has no next sale so next_sale is NULL. Every other row shows both neighbors.
change_from_prev shows whether each sale was up or down from the previous one. This is month-over-month or sale-by-sale comparison without any self-joins or CTEs.
LAG(amount, 2) looks two rows back. LEAD(amount, 1, 0) looks one row ahead and returns 0 instead of NULL when there is no next row. The default fill value is the third argument.
NTILE: Bucketing Into Groups
NTILE(n) divides rows into n equal buckets and assigns each row a bucket number.
result = q("""
SELECT
rep_name,
amount,
NTILE(4) OVER (ORDER BY amount DESC) AS quartile
FROM sales
ORDER BY amount DESC;
""")
print(result)
Output:
rep_name amount quartile
0 Priya 75000 1
1 Alex 62000 1
2 Alex 45000 1
3 Sam 45000 2
4 Priya 41000 2
5 Sam 28000 2
6 Priya 32000 3
...
Top 25% of sales are in quartile 1. This is how you create performance tiers, risk buckets, or customer segments by value without hardcoding thresholds.
FIRST_VALUE and LAST_VALUE
Get the first or last value in a window.
result = q("""
SELECT
rep_name,
sale_date,
amount,
FIRST_VALUE(amount) OVER (
PARTITION BY rep_name
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS first_sale,
LAST_VALUE(amount) OVER (
PARTITION BY rep_name
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS latest_sale
FROM sales
ORDER BY rep_name, sale_date;
""")
print(result)
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING means the window includes all rows in the partition from first to last. Without this, LAST_VALUE defaults to the current row which is never useful.
A Complete Real-World Example
Sales performance dashboard in one query.
result = q("""
WITH ranked AS (
SELECT
rep_name,
sale_date,
amount,
SUM(amount) OVER (PARTITION BY rep_name ORDER BY sale_date)
AS running_total,
AVG(amount) OVER (PARTITION BY rep_name)
AS rep_avg,
RANK() OVER (ORDER BY amount DESC)
AS overall_rank,
RANK() OVER (PARTITION BY rep_name ORDER BY amount DESC)
AS personal_best_rank,
LAG(amount) OVER (PARTITION BY rep_name ORDER BY sale_date)
AS prev_amount,
amount - LAG(amount) OVER (
PARTITION BY rep_name ORDER BY sale_date)
AS mom_change
FROM sales
)
SELECT
rep_name,
sale_date,
amount,
ROUND(running_total, 0) AS cumulative,
ROUND(rep_avg, 0) AS rep_average,
overall_rank,
personal_best_rank,
COALESCE(ROUND(mom_change, 0), 0) AS change_vs_prev
FROM ranked
ORDER BY rep_name, sale_date;
""")
print(result)
Running total per rep. Company-wide rank per sale. Whether each sale beat the rep's previous sale. All in one query. No joins. No Python processing.
A Resource Worth Knowing
Use the Index, Luke at use-the-index-luke.com has a section on window functions that explains not just the syntax but why the database executes them the way it does. Written by Markus Winand, it is the most practically useful resource for understanding SQL performance and window functions together. Search "Use the Index Luke window functions."
Mode Analytics SQL tutorial also has an excellent interactive window functions section at mode.com/sql-tutorial/sql-window-functions that lets you run every query in this post directly in your browser. Zero setup. Immediately useful.
Try This
Create window_functions_practice.py.
Use the sales database from this post.
Write a query that shows each sale with the rep's running total, the rep's all-time best sale (use MAX as a window function), and how far this sale is from their personal best.
Find the top 2 sales per region using RANK and a CTE to filter. Only show rank 1 and rank 2 within each region.
Calculate month-over-month growth rate for the total company revenue. January is the baseline. Show February as a percentage change from January.
Using NTILE(3), bucket all sales into three tiers: top third, middle third, bottom third. Show how many sales are in each tier and the total revenue per tier.
For each rep, show their first sale date, latest sale date, first sale amount, and latest sale amount in a single query using FIRST_VALUE and LAST_VALUE.
What's Next
One post left in Phase 4: connecting SQL to Python so your data pipelines can use both languages seamlessly. Then Phase 5: Git, GitHub, Jupyter, Colab. The tools that organize your work and make it shareable. Then the real thing starts. Machine learning.
Top comments (0)