DEV Community

Cover image for Window Functions: SQL's Most Powerful Feature Nobody Uses
Akhilesh
Akhilesh

Posted on

Window Functions: SQL's Most Powerful Feature Nobody Uses

You are going to hit a problem.

Not now. A few weeks into data analysis work, or maybe during a job interview, someone will ask you something like:

"For each sale, can you show the sale amount alongside the running total for that salesperson?"

Or: "Rank employees by salary within each department without removing any rows."

Or: "For each order, show the previous order's amount for comparison."

You will try GROUP BY. It collapses rows. Wrong.
You will try joins. Awkward, multiple CTEs. Messy.
You will try Python. Works, but you had to pull all the data out first.

Then someone tells you about window functions and you feel both relieved and slightly cheated that nobody mentioned these sooner.

This is that mention.


The Core Idea in One Sentence

Window functions compute a value for each row using other rows, without collapsing the result into groups.

GROUP BY: 8 rows go in, 3 rows come out.
Window function: 8 rows go in, 8 rows come out. Each row now has a new computed value.

That distinction is everything.


Setup

import sqlite3
import pandas as pd

conn = sqlite3.connect("window_demo.db")

conn.executescript("""
    DROP TABLE IF EXISTS sales;

    CREATE TABLE sales (
        sale_id      INTEGER PRIMARY KEY,
        rep_name     TEXT,
        region       TEXT,
        amount       REAL,
        sale_date    TEXT,
        product      TEXT
    );

    INSERT INTO sales VALUES
        (1,  'Alex',   'North', 45000, '2024-01-05', 'Laptop'),
        (2,  'Priya',  'South', 32000, '2024-01-08', 'Phone'),
        (3,  'Alex',   'North', 12000, '2024-01-12', 'Watch'),
        (4,  'Sam',    'East',  28000, '2024-01-15', 'Tablet'),
        (5,  'Priya',  'South', 75000, '2024-01-18', 'Laptop'),
        (6,  'Alex',   'North',  8500, '2024-01-22', 'Headphones'),
        (7,  'Sam',    'East',  45000, '2024-02-01', 'Laptop'),
        (8,  'Priya',  'South', 18000, '2024-02-05', 'Watch'),
        (9,  'Alex',   'North', 62000, '2024-02-10', 'Laptop'),
        (10, 'Sam',    'East',  15000, '2024-02-14', 'Headphones'),
        (11, 'Priya',  'South', 41000, '2024-02-18', 'Tablet'),
        (12, 'Alex',   'North', 33000, '2024-02-22', 'Phone');
""")
conn.commit()

def q(sql):
    return pd.read_sql_query(sql, conn)
Enter fullscreen mode Exit fullscreen mode

The Syntax First

Every window function uses the OVER() clause. That is what makes it a window function instead of a regular aggregate.

function_name() OVER (
    PARTITION BY column    -- optional: like GROUP BY for the window
    ORDER BY column        -- optional: defines row order within window
    ROWS/RANGE BETWEEN ... -- optional: defines which rows to include
)
Enter fullscreen mode Exit fullscreen mode

Without PARTITION BY: the window is the entire table.
With PARTITION BY: separate windows per group. Like GROUP BY but rows stay.


Aggregate Functions as Window Functions

Every aggregate you know (SUM, AVG, COUNT, MIN, MAX) can become a window function by adding OVER.

result = q("""
    SELECT
        rep_name,
        sale_date,
        amount,
        SUM(amount)   OVER ()                       AS company_total,
        AVG(amount)   OVER ()                       AS company_avg,
        SUM(amount)   OVER (PARTITION BY rep_name)  AS rep_total,
        AVG(amount)   OVER (PARTITION BY rep_name)  AS rep_avg,
        ROUND(100.0 * amount /
              SUM(amount) OVER (PARTITION BY rep_name), 1) AS pct_of_rep_total
    FROM sales
    ORDER BY rep_name, sale_date;
""")
print(result.round(1))
Enter fullscreen mode Exit fullscreen mode

Output:

  rep_name   sale_date  amount  company_total  company_avg  rep_total  rep_avg  pct_of_rep_total
0     Alex  2024-01-05   45000       414500.0      34541.7   160500.0  40125.0              28.0
1     Alex  2024-01-12   12000       414500.0      34541.7   160500.0  40125.0               7.5
2     Alex  2024-01-22    8500       414500.0      34541.7   160500.0  40125.0               5.3
3     Alex  2024-02-10   62000       414500.0      34541.7   160500.0  40125.0              38.6
4     Alex  2024-02-22   33000       414500.0      34541.7   160500.0  40125.0              20.6
5    Priya  2024-01-08   32000       414500.0      34541.7   166000.0  41500.0              19.3
...
Enter fullscreen mode Exit fullscreen mode

Every single row is still there. No collapsing. But each row now shows:

  • The company-wide total alongside the individual sale
  • Each rep's total alongside each individual sale
  • What percentage of that rep's revenue this one sale represents

This is impossible with GROUP BY. With window functions, one query.


Running Totals with ORDER BY

Add ORDER BY inside OVER() and the window becomes cumulative.

result = q("""
    SELECT
        rep_name,
        sale_date,
        amount,
        SUM(amount) OVER (
            PARTITION BY rep_name
            ORDER BY sale_date
        ) AS running_total
    FROM sales
    ORDER BY rep_name, sale_date;
""")
print(result)
Enter fullscreen mode Exit fullscreen mode

Output:

  rep_name   sale_date  amount  running_total
0     Alex  2024-01-05   45000        45000.0
1     Alex  2024-01-12   12000        57000.0
2     Alex  2024-01-22    8500        65500.0
3     Alex  2024-02-10   62000       127500.0
4     Alex  2024-02-22   33000       160500.0
5    Priya  2024-01-08   32000        32000.0
6    Priya  2024-01-18   75000       107000.0
...
Enter fullscreen mode Exit fullscreen mode

The running total accumulates within each rep's partition in chronological order. Alex starts at 45000, adds 12000 to get 57000, and so on. Priya's running total starts fresh at 32000 because it is a separate partition.


ROW_NUMBER, RANK, DENSE_RANK

Three ranking functions. They look similar and produce different results when there are ties.

result = q("""
    SELECT
        rep_name,
        amount,
        ROW_NUMBER()  OVER (ORDER BY amount DESC) AS row_num,
        RANK()        OVER (ORDER BY amount DESC) AS rank_val,
        DENSE_RANK()  OVER (ORDER BY amount DESC) AS dense_rank_val
    FROM sales
    ORDER BY amount DESC;
""")
print(result)
Enter fullscreen mode Exit fullscreen mode

Output:

  rep_name  amount  row_num  rank_val  dense_rank_val
0    Priya   75000        1         1               1
1     Alex   62000        2         2               2
2     Alex   45000        3         3               3
3      Sam   45000        4         3               3
4    Priya   41000        5         5               4
5     Sam    28000        6         6               5
...
Enter fullscreen mode Exit fullscreen mode

Alex and Sam both have 45000. Watch what happens:

ROW_NUMBER: assigns 3 and 4. No ties, just sequential numbering. Arbitrary which gets 3.

RANK: assigns 3 and 3, then skips to 5. The skip represents the gap where 4 would have been.

DENSE_RANK: assigns 3 and 3, then continues with 4. No gaps.

When to use which: ROW_NUMBER when you want exactly N rows with no duplicates (top 3 strictly). RANK for sports-style rankings where ties share a rank and the next rank skips. DENSE_RANK when you want tied entries to share a rank without creating gaps downstream.


Ranking Within Groups

The real power: rank within partitions.

result = q("""
    SELECT
        rep_name,
        region,
        amount,
        sale_date,
        RANK() OVER (
            PARTITION BY rep_name
            ORDER BY amount DESC
        ) AS rank_within_rep,
        RANK() OVER (
            PARTITION BY region
            ORDER BY amount DESC
        ) AS rank_within_region
    FROM sales
    ORDER BY rep_name, rank_within_rep;
""")
print(result)
Enter fullscreen mode Exit fullscreen mode

Output:

  rep_name region  amount   sale_date  rank_within_rep  rank_within_region
0     Alex  North   62000  2024-02-10                1                   1
1     Alex  North   45000  2024-01-05                2                   2
2     Alex  North   33000  2024-02-22                3                   3
3     Alex  North   12000  2024-01-12                4                   4
4     Alex  North    8500  2024-01-22                5                   5
5    Priya  South   75000  2024-01-18                1                   1
...
Enter fullscreen mode Exit fullscreen mode

Each rep ranked within their own results. Each rep also ranked within their region (which happens to be the same here since each rep owns one region). In a real dataset with multiple reps per region, these would differ.


LAG and LEAD: Looking at Adjacent Rows

LAG looks at the previous row. LEAD looks at the next row. Both without any joins.

result = q("""
    SELECT
        rep_name,
        sale_date,
        amount,
        LAG(amount)  OVER (PARTITION BY rep_name ORDER BY sale_date) AS prev_sale,
        LEAD(amount) OVER (PARTITION BY rep_name ORDER BY sale_date) AS next_sale,
        amount - LAG(amount) OVER (
            PARTITION BY rep_name ORDER BY sale_date
        ) AS change_from_prev
    FROM sales
    ORDER BY rep_name, sale_date;
""")
print(result)
Enter fullscreen mode Exit fullscreen mode

Output:

  rep_name   sale_date  amount  prev_sale  next_sale  change_from_prev
0     Alex  2024-01-05   45000        NaN    12000.0               NaN
1     Alex  2024-01-12   12000    45000.0     8500.0          -33000.0
2     Alex  2024-01-22    8500    12000.0    62000.0           -3500.0
3     Alex  2024-02-10   62000     8500.0    33000.0           53500.0
4     Alex  2024-02-22   33000    62000.0        NaN          -29000.0
Enter fullscreen mode Exit fullscreen mode

The first row has no previous sale so prev_sale is NULL. The last row has no next sale so next_sale is NULL. Every other row shows both neighbors.

change_from_prev shows whether each sale was up or down from the previous one. This is month-over-month or sale-by-sale comparison without any self-joins or CTEs.

LAG(amount, 2) looks two rows back. LEAD(amount, 1, 0) looks one row ahead and returns 0 instead of NULL when there is no next row. The default fill value is the third argument.


NTILE: Bucketing Into Groups

NTILE(n) divides rows into n equal buckets and assigns each row a bucket number.

result = q("""
    SELECT
        rep_name,
        amount,
        NTILE(4) OVER (ORDER BY amount DESC) AS quartile
    FROM sales
    ORDER BY amount DESC;
""")
print(result)
Enter fullscreen mode Exit fullscreen mode

Output:

  rep_name  amount  quartile
0    Priya   75000         1
1     Alex   62000         1
2     Alex   45000         1
3      Sam   45000         2
4    Priya   41000         2
5     Sam    28000         2
6    Priya   32000         3
...
Enter fullscreen mode Exit fullscreen mode

Top 25% of sales are in quartile 1. This is how you create performance tiers, risk buckets, or customer segments by value without hardcoding thresholds.


FIRST_VALUE and LAST_VALUE

Get the first or last value in a window.

result = q("""
    SELECT
        rep_name,
        sale_date,
        amount,
        FIRST_VALUE(amount) OVER (
            PARTITION BY rep_name
            ORDER BY sale_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        ) AS first_sale,
        LAST_VALUE(amount) OVER (
            PARTITION BY rep_name
            ORDER BY sale_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        ) AS latest_sale
    FROM sales
    ORDER BY rep_name, sale_date;
""")
print(result)
Enter fullscreen mode Exit fullscreen mode

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING means the window includes all rows in the partition from first to last. Without this, LAST_VALUE defaults to the current row which is never useful.


A Complete Real-World Example

Sales performance dashboard in one query.

result = q("""
    WITH ranked AS (
        SELECT
            rep_name,
            sale_date,
            amount,
            SUM(amount)     OVER (PARTITION BY rep_name ORDER BY sale_date)
                                                AS running_total,
            AVG(amount)     OVER (PARTITION BY rep_name)
                                                AS rep_avg,
            RANK()          OVER (ORDER BY amount DESC)
                                                AS overall_rank,
            RANK()          OVER (PARTITION BY rep_name ORDER BY amount DESC)
                                                AS personal_best_rank,
            LAG(amount)     OVER (PARTITION BY rep_name ORDER BY sale_date)
                                                AS prev_amount,
            amount - LAG(amount) OVER (
                PARTITION BY rep_name ORDER BY sale_date)
                                                AS mom_change
        FROM sales
    )
    SELECT
        rep_name,
        sale_date,
        amount,
        ROUND(running_total, 0)  AS cumulative,
        ROUND(rep_avg, 0)        AS rep_average,
        overall_rank,
        personal_best_rank,
        COALESCE(ROUND(mom_change, 0), 0) AS change_vs_prev
    FROM ranked
    ORDER BY rep_name, sale_date;
""")
print(result)
Enter fullscreen mode Exit fullscreen mode

Running total per rep. Company-wide rank per sale. Whether each sale beat the rep's previous sale. All in one query. No joins. No Python processing.


A Resource Worth Knowing

Use the Index, Luke at use-the-index-luke.com has a section on window functions that explains not just the syntax but why the database executes them the way it does. Written by Markus Winand, it is the most practically useful resource for understanding SQL performance and window functions together. Search "Use the Index Luke window functions."

Mode Analytics SQL tutorial also has an excellent interactive window functions section at mode.com/sql-tutorial/sql-window-functions that lets you run every query in this post directly in your browser. Zero setup. Immediately useful.


Try This

Create window_functions_practice.py.

Use the sales database from this post.

Write a query that shows each sale with the rep's running total, the rep's all-time best sale (use MAX as a window function), and how far this sale is from their personal best.

Find the top 2 sales per region using RANK and a CTE to filter. Only show rank 1 and rank 2 within each region.

Calculate month-over-month growth rate for the total company revenue. January is the baseline. Show February as a percentage change from January.

Using NTILE(3), bucket all sales into three tiers: top third, middle third, bottom third. Show how many sales are in each tier and the total revenue per tier.

For each rep, show their first sale date, latest sale date, first sale amount, and latest sale amount in a single query using FIRST_VALUE and LAST_VALUE.


What's Next

One post left in Phase 4: connecting SQL to Python so your data pipelines can use both languages seamlessly. Then Phase 5: Git, GitHub, Jupyter, Colab. The tools that organize your work and make it shareable. Then the real thing starts. Machine learning.

Top comments (0)