Mariam Turnesh

Posted on Mar 2

sql join and windows function

#beginners #database #sql #tutorial

🗃️ The Dataset We'll Use Throughout

Let's set up two tables and keep them consistent across every example. This way, when something changes between queries, you know exactly why.

CREATE TABLE employees (
    emp_id       INT PRIMARY KEY,
    full_name    VARCHAR(100),
    dept_id      INT,
    salary       DECIMAL(10,2),
    hire_date    DATE
);

INSERT INTO employees VALUES
(1, 'Alice Mensah',    1, 72000, '2020-03-15'),
(2, 'Brian Okonkwo',   2, 58000, '2021-07-01'),
(3, 'Carol Wanjiku',   1, 65000, '2019-11-20'),
(4, 'David Kimani',    NULL, 49000, '2023-01-10'),  -- no department assigned
(5, 'Evelyn Njoroge',  3, 91000, '2018-06-05');

CREATE TABLE departments (
    dept_id    INT PRIMARY KEY,
    dept_name  VARCHAR(100),
    location   VARCHAR(100)
);

INSERT INTO departments VALUES
(1, 'Engineering',    'Nairobi'),
(2, 'Marketing',      'Lagos'),
(3, 'Finance',        'Accra'),
(4, 'Human Resources', 'Nairobi');  -- no employees assigned yet

Two things to notice before we write a single join:

David (emp_id 4) has dept_id = NULL — he exists but belongs to no department.
Department 4 (Human Resources) has no employees assigned to it yet.

These two edge cases are what separate the join types from each other. Every join answers the same question differently: what do you do with rows that have no match?

Part 1 — SQL Joins

What Is a Join?

In relational databases, data is organized into multiple tables to keep it structured and efficient. SQL Joins allow you to combine related information from these tables into one unified result. That is what a join does — it combines rows from two tables based on a matching column (usually a primary key and a foreign key).

The ON clause is where you define the match condition. SQL evaluates it row by row and decides what to include in the result based on which join type you used.

-- General shape of a join
SELECT   columns
FROM     table_a
[JOIN TYPE] table_b ON table_a.column = table_b.column;

INNER JOIN — Only the Matches

An INNER JOIN returns rows only when a match exists in both tables. Rows that have no match on either side are dropped.

SELECT
    e.full_name,
    e.salary,
    d.dept_name,
    d.location
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;

Result:

full_name	salary	dept_name	location
Alice Mensah	72,000.00	Engineering	Nairobi
Brian Okonkwo	58,000.00	Marketing	Lagos
Carol Wanjiku	65,000.00	Engineering	Nairobi
Evelyn Njoroge	91,000.00	Finance	Accra

What happened to David? His dept_id is NULL. There is no row in departments where dept_id = NULL, so he has no match — INNER JOIN drops him entirely.

What happened to Human Resources? No employee has dept_id = 4, so it also has no match — also dropped.

✅ Use INNER JOIN when you only want records with complete, matching data on both sides. It is the strictest join — nothing without a partner gets through.

LEFT JOIN — Keep Everything on the Left

A LEFT JOIN returns all rows from the left table, plus any matching rows from the right. Where there is no match, the right-side columns are filled with NULL.

SELECT
    e.full_name,
    e.salary,
    d.dept_name,
    d.location
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id;

Result:

full_name	salary	dept_name	location
Alice Mensah	72,000.00	Engineering	Nairobi
Brian Okonkwo	58,000.00	Marketing	Lagos
Carol Wanjiku	65,000.00	Engineering	Nairobi
David Kimani	49,000.00	NULL	NULL
Evelyn Njoroge	91,000.00	Finance	Accra

David is back — with NULL where his department would be. Human Resources is still absent because it lives in the right table and LEFT JOIN does not protect it.

✅ Use LEFT JOIN when you need all records from one table regardless of whether they have a match. Classic use case: show all customers and their orders, including customers who haven't ordered yet.

Pro tip — the anti-join pattern:

A LEFT JOIN + WHERE IS NULL finds rows in the left table that have no match in the right table at all:

-- Find employees with no department assigned
SELECT e.full_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id
WHERE d.dept_id IS NULL;

-- Returns: David Kimani

RIGHT JOIN — Keep Everything on the Right

A RIGHT JOIN keeps all rows from the right table, and fills NULL for unmatched left-side columns.

SELECT
    e.full_name,
    d.dept_name,
    d.location
FROM employees e
RIGHT JOIN departments d ON e.dept_id = d.dept_id;

Result:

full_name	dept_name	location
Alice Mensah	Engineering	Nairobi
Carol Wanjiku	Engineering	Nairobi
Brian Okonkwo	Marketing	Lagos
Evelyn Njoroge	Finance	Accra
NULL	Human Resources	Nairobi

Human Resources now appears with NULL for full_name. David is gone — he is in the left table and has no right-table match.

💡 Most developers rewrite RIGHT JOIN as a LEFT JOIN by swapping the table order. The two queries below produce identical results:

-- As RIGHT JOIN:
SELECT e.full_name, d.dept_name
FROM employees e
RIGHT JOIN departments d ON e.dept_id = d.dept_id;

-- Equivalent as LEFT JOIN (usually preferred for readability):
SELECT e.full_name, d.dept_name
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id;

FULL OUTER JOIN — Keep Everything from Both Sides

A FULL OUTER JOIN returns every row from both tables, using NULL to fill gaps where a match does not exist.

SELECT
    e.full_name,
    d.dept_name,
    d.location
FROM employees e
FULL OUTER JOIN departments d ON e.dept_id = d.dept_id;

Result:

full_name	dept_name	location
Alice Mensah	Engineering	Nairobi
Brian Okonkwo	Marketing	Lagos
Carol Wanjiku	Engineering	Nairobi
David Kimani	NULL	NULL
Evelyn Njoroge	Finance	Accra
NULL	Human Resources	Nairobi

Both David and Human Resources appear — no row is lost from either table.

✅ Use FULL OUTER JOIN when reconciling two data sources, auditing for orphaned records, or merging datasets where you cannot afford to lose rows from either side.

⚠️ MySQL does not support FULL OUTER JOIN natively. Emulate it with a UNION:

SELECT e.full_name, d.dept_name
FROM employees e LEFT JOIN departments d ON e.dept_id = d.dept_id
UNION
SELECT e.full_name, d.dept_name
FROM employees e RIGHT JOIN departments d ON e.dept_id = d.dept_id;

SELF JOIN — A Table That References Itself

A SELF JOIN joins a table to itself. It is the natural solution for hierarchical data — like an employee table where each row stores a manager_id pointing to another row in the same table.

-- Each employee reports to a manager who is also in the employees table
ALTER TABLE employees ADD COLUMN manager_id INT;
UPDATE employees SET manager_id = 3 WHERE emp_id IN (1, 2); -- Alice & Brian → Carol
UPDATE employees SET manager_id = 5 WHERE emp_id IN (3, 4); -- Carol & David → Evelyn

-- Find each employee and their manager's name
SELECT
    e.full_name  AS employee,
    m.full_name  AS reports_to
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.emp_id;

Result:

employee	reports_to
Alice Mensah	Carol Wanjiku
Brian Okonkwo	Carol Wanjiku
Carol Wanjiku	Evelyn Njoroge
David Kimani	Evelyn Njoroge
Evelyn Njoroge	NULL

We use aliases (e for the employee copy, m for the manager copy) to tell SQL which version of the table we mean. We use LEFT JOIN so Evelyn — who has no manager — still appears.

CROSS JOIN — Every Possible Combination

A CROSS JOIN produces the Cartesian product of two tables. There is no ON clause because you are not matching rows — you are pairing every row from Table A with every row from Table B.

SELECT
    s.size_name,
    c.colour_name,
    s.size_name || '-' || c.colour_name AS sku
FROM sizes s         -- 3 rows: S, M, L
CROSS JOIN colours c; -- 4 rows: Black, White, Red, Navy
-- Result: 3 × 4 = 12 rows

⚠️ Be careful with scale. A CROSS JOIN on two tables with 10,000 rows each produces 100,000,000 rows. Always use it intentionally.

Multi-Table Joins

You can chain as many joins as you need in a single query. SQL builds an intermediate result after each one and feeds it into the next.

-- Full order breakdown: customer → order → product → category
SELECT
    c.full_name,
    o.order_id,
    o.order_date,
    p.product_name,
    cat.category_name,
    oi.quantity,
    oi.unit_price,
    (oi.quantity * oi.unit_price) AS line_total
FROM customers c
INNER JOIN orders      o   ON c.customer_id = o.customer_id
INNER JOIN order_items oi  ON o.order_id    = oi.order_id
INNER JOIN products    p   ON oi.product_id = p.product_id
INNER JOIN categories  cat ON p.category_id = cat.category_id
WHERE o.order_date >= '2024-01-01'
ORDER BY line_total DESC;

Join Type Reference

Join Type	Returns	Drops Unmatched Rows?	Best Use Case
`INNER JOIN`	Matching rows from both tables	Both sides dropped	Orders with a valid customer
`LEFT JOIN`	All left + matching right	Right side only	All customers, orders or not
`RIGHT JOIN`	All right + matching left	Left side only	All departments, staffed or not
`FULL OUTER JOIN`	Everything from both tables	Neither side dropped	Reconciling two data sources
`SELF JOIN`	Same table joined to itself	Depends on join type	Employee–manager hierarchy
`CROSS JOIN`	Every row × every row	N/A	Generate all product variants

Common Join Mistakes

1. Missing ON clause → Cartesian product

-- ❌ This silently returns employees × departments (20 rows of nonsense)
SELECT * FROM employees, departments;

-- ✅ Always specify the match condition
SELECT * FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;

2. Ambiguous column names

-- ❌ SQL can't tell which table's dept_id you want
SELECT dept_id FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;

-- ✅ Always qualify with a table alias
SELECT e.dept_id FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;

3. Using INNER JOIN when you should use LEFT JOIN

If a stakeholder asks "why is David missing from this report?" — that is almost always a sign you used INNER JOIN where you needed LEFT JOIN. Any time unmatched rows should still appear in output, reach for LEFT JOIN.

Part 2 — Window Functions

The Problem They Exist to Solve

Here is a question that looks simple but trips up most beginners:

"Show me each employee's name, salary, and the average salary for their department — all on the same row."

Your first instinct might be GROUP BY:

-- ❌ GROUP BY collapses all employees into one row per department
SELECT dept_id, AVG(salary) AS dept_avg
FROM employees
GROUP BY dept_id;

-- Returns 3 rows (one per dept) — individual employees are gone

Window functions solve this. They perform calculations across a group of rows without collapsing those rows. Every employee row stays, and the calculation is added as an extra column.

-- ✅ Window function: every row stays, department average added
SELECT
    full_name,
    dept_id,
    salary,
    AVG(salary) OVER (PARTITION BY dept_id) AS dept_avg_salary
FROM employees;

full_name	dept_id	salary	dept_avg_salary
Alice Mensah	1	72,000.00	68,500.00
Carol Wanjiku	1	65,000.00	68,500.00
Brian Okonkwo	2	58,000.00	58,000.00
Evelyn Njoroge	3	91,000.00	91,000.00
David Kimani	NULL	49,000.00	49,000.00

Alice and Carol share a department average of 68,500 — the average of their two salaries. Every individual row is preserved.

Anatomy of the OVER() Clause

Every window function is paired with an OVER() clause that has three optional components:

function_name(column) OVER (
    PARTITION BY  grouping_column    -- divide rows into independent sub-groups
    ORDER BY      sort_column        -- define row order within each group
    ROWS BETWEEN  frame_start        -- define how far back/forward to look
                  AND frame_end
)

Component	What It Does	Required?
`PARTITION BY`	Divides rows into groups; function restarts per group	No
`ORDER BY`	Sorts rows within each partition	Only for rankings & running totals
Frame clause	Controls which rows are included in the calculation	No (has default)

ROW_NUMBER, RANK, and DENSE_RANK

These three functions assign a position number to each row within a partition. They differ only in how they handle ties.

SELECT
    full_name,
    dept_id,
    salary,
    ROW_NUMBER()  OVER (PARTITION BY dept_id ORDER BY salary DESC) AS row_num,
    RANK()        OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rnk,
    DENSE_RANK()  OVER (PARTITION BY dept_id ORDER BY salary DESC) AS dense_rnk
FROM employees;

To see the difference clearly, picture four employees with salaries: 80,000 / 72,000 / 72,000 / 58,000:

ROW_NUMBER:  1, 2, 3, 4   ← always unique, ties get different numbers
RANK:        1, 2, 2, 4   ← tied rows share a rank; next rank is skipped (no 3)
DENSE_RANK:  1, 2, 2, 3   ← tied rows share a rank; no numbers are skipped

Function	Tied Rows Get	Next Rank After Tie	Best Used For
`ROW_NUMBER()`	Different numbers (arbitrary)	Continues normally	Deduplication, pagination
`RANK()`	Same rank	Skips to account for ties	Sports leaderboards
`DENSE_RANK()`	Same rank	Continues without skipping	Clean rankings, interview questions

The Top-N per group pattern — by far the most common window function use case:

-- Top 2 earners per department
-- Window functions can't go in WHERE directly, so we wrap in a CTE
WITH ranked_employees AS (
    SELECT
        full_name,
        dept_id,
        salary,
        ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rn
    FROM employees
)
SELECT full_name, dept_id, salary
FROM ranked_employees
WHERE rn <= 2;

⚠️ Why can't window functions go in WHERE? SQL's execution order is: FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. Window functions run during SELECT — after WHERE has already filtered rows. To filter on a window function result, always wrap it in a CTE or subquery first.

LAG and LEAD — Looking Backwards and Forwards

LAG(column, n) returns the value from n rows before the current row.
LEAD(column, n) returns the value from n rows after the current row.
Both accept an optional third argument as a default when no prior/next row exists.

Before window functions, period-over-period comparisons required a self-join:

-- ❌ Old way: self-join to get last month's revenue
SELECT a.sale_month, a.revenue, b.revenue AS prev_month_revenue
FROM monthly_revenue a
LEFT JOIN monthly_revenue b
    ON b.sale_month = DATEADD(month, -1, a.sale_month);

-- ✅ New way: LAG does it in one line
SELECT sale_month, revenue,
    LAG(revenue, 1, 0) OVER (ORDER BY sale_month) AS prev_month_revenue
FROM monthly_revenue;

Full month-over-month analysis:

SELECT
    sale_month,
    revenue,
    LAG(revenue, 1)  OVER (ORDER BY sale_month) AS prev_month,
    LEAD(revenue, 1) OVER (ORDER BY sale_month) AS next_month,
    revenue - LAG(revenue, 1) OVER (ORDER BY sale_month) AS mom_change,
    ROUND(
        100.0 * (revenue - LAG(revenue, 1) OVER (ORDER BY sale_month))
        / NULLIF(LAG(revenue, 1) OVER (ORDER BY sale_month), 0),
        1
    ) AS mom_pct_change
FROM monthly_revenue
ORDER BY sale_month;

sale_month	revenue	prev_month	next_month	mom_change	mom_pct_change
2024-01	50,000	NULL	65,000	NULL	NULL
2024-02	65,000	50,000	58,000	+15,000	+30.0%
2024-03	58,000	65,000	71,000	-7,000	-10.8%
2024-04	71,000	58,000	NULL	+13,000	+22.4%

The NULLIF(..., 0) prevents division-by-zero when prev_month is 0. The first row returns NULL for LAG because there is no prior row — that is expected, not a bug.

Running Totals and Moving Averages — The Frame Clause

When you use ORDER BY inside a SUM() or AVG() window function, SQL needs to know which rows to include for each calculation. That is what the frame clause controls.

SELECT
    sale_date,
    daily_sales,

    -- Running total: from the very first row up to this row
    SUM(daily_sales) OVER (
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total,

    -- 7-day moving average: this row + the 6 rows before it
    ROUND(AVG(daily_sales) OVER (
        ORDER BY sale_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 2) AS moving_avg_7day

FROM daily_sales
ORDER BY sale_date;

Frame clause keywords explained:

Keyword	Meaning
`UNBOUNDED PRECEDING`	The very first row in the partition
`n PRECEDING`	n rows before the current row
`CURRENT ROW`	The current row itself
`n FOLLOWING`	n rows after the current row
`UNBOUNDED FOLLOWING`	The very last row in the partition

⚠️ Default frame behaviour: When you write ORDER BY inside OVER() without a frame clause, SQL defaults to ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (a running total). This is fine for SUM, but it causes a silent bug with LAST_VALUE — see below.

FIRST_VALUE and LAST_VALUE

These functions return the first or last value in a window frame — useful for comparing each row against the boundary of its group.

SELECT
    full_name,
    dept_id,
    salary,
    FIRST_VALUE(full_name) OVER (
        PARTITION BY dept_id
        ORDER BY salary DESC
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS top_earner_in_dept,

    LAST_VALUE(full_name) OVER (
        PARTITION BY dept_id
        ORDER BY salary DESC
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS lowest_earner_in_dept

FROM employees;

⚠️ Always specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING when using LAST_VALUE. Without it, the default frame stops at the current row — meaning LAST_VALUE always returns the current row's own value. This is one of the most common silent bugs in window function code.

NTILE, PERCENT_RANK, and CUME_DIST

These three functions answer the question: where does this row sit relative to everyone else?

SELECT
    full_name,
    salary,
    NTILE(4)        OVER (ORDER BY salary DESC) AS quartile,
    ROUND(PERCENT_RANK() OVER (ORDER BY salary), 2) AS pct_rank,
    ROUND(CUME_DIST()    OVER (ORDER BY salary), 2) AS cum_dist
FROM employees
ORDER BY salary DESC;

full_name	salary	quartile	pct_rank	cum_dist
Evelyn Njoroge	91,000.00	1	1.00	1.00
Alice Mensah	72,000.00	1	0.75	0.80
Carol Wanjiku	65,000.00	2	0.50	0.60
Brian Okonkwo	58,000.00	3	0.25	0.40
David Kimani	49,000.00	4	0.00	0.20

NTILE(4) — divides rows into 4 equal buckets. Use for tiers and segments.
PERCENT_RANK — (rank - 1) / (total rows - 1). First row is always 0, last is always 1.
CUME_DIST — proportion of rows with a value ≤ the current row. Useful for "what percentage of employees earn at or below this salary?"

Named Windows — Reuse OVER() Clauses

If you use the same OVER() definition multiple times in one query, SQL lets you name it with WINDOW:

-- ❌ Repetitive — same OVER() clause written three times
SELECT
    full_name,
    AVG(salary)  OVER (PARTITION BY dept_id ORDER BY salary DESC) AS dept_avg,
    RANK()       OVER (PARTITION BY dept_id ORDER BY salary DESC) AS dept_rank,
    SUM(salary)  OVER (PARTITION BY dept_id ORDER BY salary DESC) AS dept_total
FROM employees;

-- ✅ Named window — define once, reuse everywhere
SELECT
    full_name,
    AVG(salary)  OVER dept_window AS dept_avg,
    RANK()       OVER dept_window AS dept_rank,
    SUM(salary)  OVER dept_window AS dept_total
FROM employees
WINDOW dept_window AS (PARTITION BY dept_id ORDER BY salary DESC);

Cleaner to read, and modern query engines can optimise it better.

Combining Window Functions with CTEs

The real power of window functions comes when you layer them inside CTEs to build multi-step analyses.

-- Find employees earning above their department average,
-- and rank them within their department
WITH dept_stats AS (
    SELECT
        emp_id,
        full_name,
        dept_id,
        salary,
        AVG(salary) OVER (PARTITION BY dept_id)                  AS dept_avg,
        RANK()      OVER (PARTITION BY dept_id ORDER BY salary DESC) AS dept_rank
    FROM employees
)
SELECT
    full_name,
    dept_id,
    salary,
    ROUND(dept_avg, 2)           AS dept_avg,
    ROUND(salary - dept_avg, 2)  AS above_avg_by,
    dept_rank
FROM dept_stats
WHERE salary > dept_avg
ORDER BY above_avg_by DESC;

Window Functions Reference

Function	Category	What It Does	Key Detail
`ROW_NUMBER()`	Ranking	Unique sequential number	No ties — always distinct
`RANK()`	Ranking	Rank with gaps after ties	1, 2, 2, 4
`DENSE_RANK()`	Ranking	Rank without gaps after ties	1, 2, 2, 3
`NTILE(n)`	Ranking	Divide into n equal buckets	Quartiles, deciles
`PERCENT_RANK()`	Ranking	Relative rank as 0–1 value	First row = 0, last = 1
`CUME_DIST()`	Ranking	Cumulative distribution	% of rows ≤ current value
`LAG(col, n)`	Navigation	Value n rows before current	Replaces self-join for time series
`LEAD(col, n)`	Navigation	Value n rows after current	Preview future period
`FIRST_VALUE(col)`	Navigation	First value in window frame	Great for comparing to group leader
`LAST_VALUE(col)`	Navigation	Last value in window frame	Always specify full frame!
`SUM(col)`	Aggregate	Running or partitioned total	With frame → running total
`AVG(col)`	Aggregate	Running or partitioned average	With `n PRECEDING` → moving avg
`COUNT(col)`	Aggregate	Running or partitioned count	Cumulative transaction count
`MIN(col)`	Aggregate	Minimum within window	Running minimum
`MAX(col)`	Aggregate	Maximum within window	Running maximum

Performance Tips

For Joins:

Index your join columns. Unindexed ON clause columns force full table scans. Always index foreign keys.
Filter before joining. Use a CTE or subquery to reduce rows before the join, not after.
Select only what you need. SELECT * across multiple joined tables moves far more data than necessary.
Use EXPLAIN/EXPLAIN ANALYZE. This shows the execution plan and tells you whether SQL is using your indexes.

For Window Functions:

Window functions execute after WHERE — filter first to reduce the data the window function sees.
Use named WINDOW clauses to avoid repeating identical OVER() definitions.
On very large tables, partitioning your source data before applying window functions (via a CTE) can significantly reduce compute time.

⚡ Complete Cheat Sheet

-- ─── JOINS ───────────────────────────────────────────────────────────────────

-- INNER: only rows matching in both tables
SELECT e.full_name, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;

-- LEFT: all employees, department info where it exists
SELECT e.full_name, d.dept_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id;

-- Anti-join: employees with NO department
SELECT e.full_name FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id
WHERE d.dept_id IS NULL;

-- FULL OUTER: everything from both tables
SELECT e.full_name, d.dept_name
FROM employees e
FULL OUTER JOIN departments d ON e.dept_id = d.dept_id;

-- SELF: employee + their manager
SELECT e.full_name AS employee, m.full_name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.emp_id;

-- CROSS: all size × colour combos
SELECT s.size_name, c.colour_name
FROM sizes s CROSS JOIN colours c;

-- ─── WINDOW FUNCTIONS ────────────────────────────────────────────────────────

-- Dept average without collapsing rows
SELECT full_name, salary,
    AVG(salary) OVER (PARTITION BY dept_id) AS dept_avg
FROM employees;

-- Top 2 earners per department
WITH ranked AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rn
    FROM employees
)
SELECT * FROM ranked WHERE rn <= 2;

-- RANK vs DENSE_RANK
SELECT full_name, salary,
    RANK()        OVER (ORDER BY salary DESC) AS rnk,
    DENSE_RANK()  OVER (ORDER BY salary DESC) AS dense_rnk
FROM employees;

-- Month-over-month change
SELECT sale_month, revenue,
    LAG(revenue, 1)  OVER (ORDER BY sale_month) AS prev_month,
    revenue - LAG(revenue, 1) OVER (ORDER BY sale_month) AS change
FROM monthly_revenue;

-- Running total
SELECT sale_date, daily_sales,
    SUM(daily_sales) OVER (ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM daily_sales;

-- 7-day moving average
SELECT sale_date, daily_sales,
    ROUND(AVG(daily_sales) OVER (ORDER BY sale_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW), 2) AS moving_avg_7d
FROM daily_sales;

-- NTILE: quartile segmentation
SELECT customer_name, lifetime_spend,
    NTILE(4) OVER (ORDER BY lifetime_spend DESC) AS quartile
FROM customers;

-- Named window (reuse OVER clause)
SELECT full_name, salary,
    AVG(salary) OVER w AS dept_avg,
    RANK()      OVER w AS dept_rank
FROM employees
WINDOW w AS (PARTITION BY dept_id ORDER BY salary DESC);

Key Takeaways

On Joins:

Every join type answers the same question differently: what happens to rows that have no match?
INNER JOIN drops unmatched rows from both sides.
LEFT JOIN protects the left table — it is the most used join in day-to-day analytics.
FULL OUTER JOIN protects both sides — use it when you cannot afford to lose rows from either table.
Always index join columns. Always alias your tables. Always qualify ambiguous column names.

On Window Functions:

GROUP BY collapses rows. Window functions add information to rows without removing them — that distinction is everything.
PARTITION BY divides into groups but keeps all rows. It is not GROUP BY.
ORDER BY inside OVER() is independent of the query-level ORDER BY.
Window functions run during SELECT — after WHERE — so you must wrap them in a CTE or subquery to filter on their results.
Always specify a full frame clause when using LAST_VALUE.

Anything unclear or want a deeper dive into a specific function? Drop it in the comments.

DEV Community

sql join and windows function

🗃️ The Dataset We'll Use Throughout

Part 1 — SQL Joins

What Is a Join?

INNER JOIN — Only the Matches

LEFT JOIN — Keep Everything on the Left

RIGHT JOIN — Keep Everything on the Right

FULL OUTER JOIN — Keep Everything from Both Sides

SELF JOIN — A Table That References Itself

CROSS JOIN — Every Possible Combination

Multi-Table Joins

Join Type Reference

Common Join Mistakes

Part 2 — Window Functions

The Problem They Exist to Solve

Anatomy of the OVER() Clause

ROW_NUMBER, RANK, and DENSE_RANK

LAG and LEAD — Looking Backwards and Forwards

Running Totals and Moving Averages — The Frame Clause

FIRST_VALUE and LAST_VALUE

NTILE, PERCENT_RANK, and CUME_DIST

Named Windows — Reuse OVER() Clauses

Combining Window Functions with CTEs

Window Functions Reference

Performance Tips

⚡ Complete Cheat Sheet

Key Takeaways

Top comments (0)