Mariam Turnesh

Posted on Mar 2

SQL Joins and Window Functions Explained

#sql #beginners #database #dataanalysis

The Tables We'll Use

We'll keep the same two tables throughout so examples stay consistent and easy to follow.

CREATE TABLE employees (
    emp_id    INT PRIMARY KEY,
    name      VARCHAR(100),
    dept_id   INT,
    salary    DECIMAL(10,2)
);

INSERT INTO employees VALUES
(1, 'Alice',  1, 72000),
(2, 'Brian',  2, 58000),
(3, 'Carol',  1, 65000),
(4, 'Eve',    3, 91000);

CREATE TABLE departments (
    dept_id    INT PRIMARY KEY,
    dept_name  VARCHAR(100),
    location   VARCHAR(100)
);

INSERT INTO departments VALUES
(1, 'Engineering', 'Nairobi'),
(2, 'Marketing',   'Lagos'),
(3, 'Finance',     'Accra');

Part 1 — SQL Joins

What is a Join?

In relational databases, data is stored in separate tables to keep it organized, reduce duplication, and improve efficiency. For example: employee details might be stored in one table, while department information is stored in another.

When you need information from both tables at the same time, you use a JOIN. A JOIN combines rows from two tables using a related column, typically a primary key in one table and a foreign key in another. This relationship defines how the records are connected.

The ON clause specifies the exact condition used to match rows between the tables. SQL evaluates this condition to determine which records relate to each other. The result is a single, combined dataset that brings together information that was originally stored separately.

For instance, if both employees and departments share a column like dept_id, that column acts as the bridge between them. The JOIN uses it to match each employee to the correct department.

FROM table_a
JOIN table_b ON table_a.shared_column = table_b.shared_column

A Note on Aliases

Before we get into join types, let's talk about aliases — the shorthand names you see on tables (e, d) in most join queries.

-- Without aliases works,but it gets unnecessarily wordy
SELECT employees.name, departments.dept_name
FROM employees
INNER JOIN departments ON employees.dept_id = departments.dept_id;

-- With aliases it is same result, much cleaner
SELECT e.name, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;

You assign an alias by putting a short name right after the table name. From that point on, e means employees and d means departments. Aliases are especially important when:

Your table names are long
You're joining multiple tables and need to be clear about which column belongs to which table
You're writing a self join, where the same table appears twice and needs two different names

You'll see aliases in every example from here on.

INNER JOIN

An INNER JOIN returns only the rows where a match exists in both tables. If a row in one table has no matching row in the other, it is left out entirely.

SELECT e.name, e.salary, d.dept_name, d.location
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;

name	salary	dept_name	location
Alice	72,000.00	Engineering	Nairobi
Brian	58,000.00	Marketing	Lagos
Carol	65,000.00	Engineering	Nairobi
Eve	91,000.00	Finance	Accra

Every employee has a matching department, so every employee appears. If any employee had a dept_id that didn't exist in departments, they would be excluded.

✅ Use INNER JOIN when you only want rows with complete matching data on both sides. It's the strictest join, nothing without a match gets through.

LEFT JOIN

A LEFT JOIN returns all rows from the left table, and brings in matching data from the right table where it exists. Where there's no match, the right-side columns come back as NULL.

The "left" table is simply whichever table you write after FROM.

SELECT e.name, e.salary, d.dept_name, d.location
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id;

name	salary	dept_name	location
Alice	72,000.00	Engineering	Nairobi
Brian	58,000.00	Marketing	Lagos
Carol	65,000.00	Engineering	Nairobi
Eve	91,000.00	Finance	Accra

With our clean dataset this looks identical to the INNER JOIN result , that's fine. The real difference shows up the moment an employee has no matching department. They'd still appear, with NULL in the department columns.

✅ Use LEFT JOIN when you need every record from the left table regardless of whether it has a match. This is the most commonly used join in analytics. Classic example: show all customers, including those who haven't placed an order yet.

RIGHT JOIN

A RIGHT JOIN is the mirror of a LEFT JOIN. It keeps all rows from the right table and fills NULL for unmatched left-side columns.

SELECT e.name, d.dept_name, d.location
FROM employees e
RIGHT JOIN departments d ON e.dept_id = d.dept_id;

name	dept_name	location
Alice	Engineering	Nairobi
Carol	Engineering	Nairobi
Brian	Marketing	Lagos
Eve	Finance	Accra

Here all departments are protected. If a department had no employees, it would still appear with NULL for the employee name.

💡 In practice, most developers rewrite RIGHT JOIN as a LEFT JOIN by swapping the table order — it reads more naturally. These two queries return the same result:

-- As RIGHT JOIN
SELECT e.name, d.dept_name
FROM employees e
RIGHT JOIN departments d ON e.dept_id = d.dept_id;

-- Same result as LEFT JOIN (swap the tables)
SELECT e.name, d.dept_name
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id;

FULL OUTER JOIN

A FULL OUTER JOIN returns every row from both tables. Where there's no match on either side, NULL fills in the gap. Nothing is left out.

SELECT e.name, d.dept_name, d.location
FROM employees e
FULL OUTER JOIN departments d ON e.dept_id = d.dept_id;

name	dept_name	location
Alice	Engineering	Nairobi
Brian	Marketing	Lagos
Carol	Engineering	Nairobi
Eve	Finance	Accra

With our dataset every row matches, so the result looks complete. But when working with real-world data — where mismatches and missing records are common — FULL OUTER JOIN ensures nothing from either side gets silently dropped.

✅ Use FULL OUTER JOIN when reconciling two data sources or auditing records where you cannot afford to lose rows from either side.

⚠️ MySQL doesn't support FULL OUTER JOIN natively. Use a UNION instead:

SELECT e.name, d.dept_name
FROM employees e LEFT JOIN departments d ON e.dept_id = d.dept_id
UNION
SELECT e.name, d.dept_name
FROM employees e RIGHT JOIN departments d ON e.dept_id = d.dept_id;

SELF JOIN

A SELF JOIN joins a table to itself. It sounds strange, but it's the natural tool for hierarchical data — the classic example being an employee table where each row has a manager_id that points to another employee in the same table.

ALTER TABLE employees ADD COLUMN manager_id INT;
UPDATE employees SET manager_id = 4 WHERE emp_id IN (1, 2, 3); -- Alice, Brian, Carol report to Eve

-- Join the table to itself to get manager names
SELECT
    e.name          AS employee,
    m.name          AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.emp_id;

employee	manager
Alice	Eve
Brian	Eve
Carol	Eve
Eve	NULL

Notice the aliases here — e for the employee and m for the manager. Without aliases, SQL wouldn't know which copy of the table you mean. Eve has no manager, so she returns NULL — which is why we use LEFT JOIN instead of INNER JOIN.

✅ Use SELF JOIN when your data has a relationship with itself — org charts, categories with subcategories, or any parent-child structure stored in one table.

CROSS JOIN

A CROSS JOIN produces the Cartesian product of two tables — every row from Table A is paired with every row from Table B. There's no ON clause because you're not matching rows, you're combining all of them.

SELECT s.size_name, c.colour_name
FROM sizes s
CROSS JOIN colours c;

If sizes has 3 rows (S, M, L) and colours has 4 rows (Black, White, Red, Navy):

size_name	colour_name
S	Black
S	White
S	Red
S	Navy
M	Black
M	White
...	...

3 × 4 = 12 rows total.

⚠️ On large tables this explodes fast. Two tables with 1,000 rows each produce 1,000,000 rows. Always use CROSS JOIN intentionally.

Join Types at a Glance

Join	Returns	Drops Unmatched?
`INNER JOIN`	Matching rows from both tables	Yes — both sides
`LEFT JOIN`	All left rows + matching right	Right side only
`RIGHT JOIN`	All right rows + matching left	Left side only
`FULL OUTER JOIN`	All rows from both tables	Neither side
`SELF JOIN`	Table joined to itself	Depends on join type used
`CROSS JOIN`	Every possible row combination	N/A

Part 2 — Window Functions

What is a Window Function?

A window function performs a calculation across a set of rows without collapsing them into one. That's the key difference from GROUP BY.

Say you want each employee's salary alongside the average salary for their department:

-- GROUP BY collapses individual rows — you lose the employees
SELECT dept_id, AVG(salary) AS dept_avg
FROM employees
GROUP BY dept_id;

dept_id	dept_avg
1	68,500.00
2	58,000.00
3	91,000.00

The individual employees are gone. You got the group summary but lost the detail.

With a window function:

SELECT
    e.name,
    e.salary,
    d.dept_name,
    AVG(e.salary) OVER (PARTITION BY e.dept_id) AS dept_avg
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id;

name	salary	dept_name	dept_avg
Alice	72,000.00	Engineering	68,500.00
Carol	65,000.00	Engineering	68,500.00
Brian	58,000.00	Marketing	58,000.00
Eve	91,000.00	Finance	91,000.00

Every employee row stays. The department average is added as an extra column. Alice and Carol share 68,500 — the average of their two salaries.

The rule: GROUP BY collapses rows into summaries. Window functions add a calculation to each row without removing any.

The OVER() Clause

Every window function uses an OVER() clause. This is what defines the "window" — the set of rows the function looks at for each calculation.

function_name() OVER (
    PARTITION BY  column   -- split rows into groups
    ORDER BY      column   -- sort rows within each group
)

PARTITION BY — divides the data into groups. The function restarts its calculation for each group. Without it, all rows are treated as one single window.
ORDER BY inside OVER() — sorts rows within each partition. Required for ranking and running totals. This is separate from the ORDER BY at the end of your query.

ROW_NUMBER, RANK, and DENSE_RANK

These three functions assign a position number to each row within a partition. The difference comes down to how they handle ties.

SELECT
    e.name,
    d.dept_name,
    e.salary,
    ROW_NUMBER() OVER (PARTITION BY e.dept_id ORDER BY e.salary DESC) AS row_num,
    RANK()       OVER (PARTITION BY e.dept_id ORDER BY e.salary DESC) AS rnk,
    DENSE_RANK() OVER (PARTITION BY e.dept_id ORDER BY e.salary DESC) AS dense_rnk
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id;

name	dept_name	salary	row_num	rnk	dense_rnk
Alice	Engineering	72,000.00	1	1	1
Carol	Engineering	65,000.00	2	2	2
Brian	Marketing	58,000.00	1	1	1
Eve	Finance	91,000.00	1	1	1

The numbering resets for each department because of PARTITION BY dept_id. To see the difference between RANK and DENSE_RANK, you need a tie. Imagine two employees both earning 72,000:

salary	ROW_NUMBER	RANK	DENSE_RANK
72,000.00	1	1	1
72,000.00	2	1	1
65,000.00	3	3	2
58,000.00	4	4	3

ROW_NUMBER — always unique. Ties get different numbers.
RANK — tied rows share a rank. The next rank skips to account for the tie (1, 1, 3).
DENSE_RANK — tied rows share a rank. No numbers are skipped (1, 1, 2).

Getting the top earner per department:

-- Window functions can't go in WHERE directly.
-- Wrap them in a CTE first, then filter.
WITH ranked AS (
    SELECT
        e.name,
        d.dept_name,
        e.salary,
        ROW_NUMBER() OVER (PARTITION BY e.dept_id ORDER BY e.salary DESC) AS rn
    FROM employees e
    JOIN departments d ON e.dept_id = d.dept_id
)
SELECT name, dept_name, salary
FROM ranked
WHERE rn = 1;

name	dept_name	salary
Alice	Engineering	72,000.00
Brian	Marketing	58,000.00
Eve	Finance	91,000.00

LAG and LEAD

LAG looks back at a previous row's value. LEAD looks ahead at a future row's value. They're the standard tool for period-over-period comparisons.

SELECT
    sale_month,
    revenue,
    LAG(revenue, 1)  OVER (ORDER BY sale_month) AS prev_month,
    LEAD(revenue, 1) OVER (ORDER BY sale_month) AS next_month,
    revenue - LAG(revenue, 1) OVER (ORDER BY sale_month) AS change
FROM monthly_revenue
ORDER BY sale_month;

sale_month	revenue	prev_month	next_month	change
2024-01	50,000	NULL	65,000	NULL
2024-02	65,000	50,000	58,000	+15,000
2024-03	58,000	65,000	71,000	-7,000
2024-04	71,000	58,000	NULL	+13,000

LAG(revenue, 1) means: look back 1 row and return its revenue value. The first row returns NULL for LAG because there's no row before it — that's expected, not an error.

✅ Use LAG/LEAD when comparing values across time periods — monthly revenue, daily active users, week-over-week sales.

Running Totals with SUM() OVER

Using SUM() as a window function with ORDER BY gives you a running total — each row accumulates the sum of all previous rows.

SELECT
    sale_date,
    daily_sales,
    SUM(daily_sales) OVER (
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM daily_sales
ORDER BY sale_date;

sale_date	daily_sales	running_total
2024-01-01	1,000	1,000
2024-01-02	1,500	2,500
2024-01-03	800	3,300
2024-01-04	2,200	5,500

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW tells SQL: include every row from the very beginning of the partition up to the current row.

NTILE — Dividing Rows into Buckets

NTILE(n) divides your rows into n equal groups and assigns each row a bucket number. It's the go-to function for creating tiers, quartiles, and percentile segments.

SELECT
    e.name,
    e.salary,
    NTILE(4) OVER (ORDER BY e.salary DESC) AS quartile,
    CASE NTILE(4) OVER (ORDER BY e.salary DESC)
        WHEN 1 THEN 'Top 25%'
        WHEN 2 THEN 'Upper Mid'
        WHEN 3 THEN 'Lower Mid'
        WHEN 4 THEN 'Bottom 25%'
    END AS salary_band
FROM employees e
ORDER BY e.salary DESC;

name	salary	quartile	salary_band
Eve	91,000.00	1	Top 25%
Alice	72,000.00	2	Upper Mid
Carol	65,000.00	3	Lower Mid
Brian	58,000.00	4	Bottom 25%

Window Functions at a Glance

Function	What It Does
`ROW_NUMBER()`	Unique sequential number per row — no ties
`RANK()`	Rank with gaps after ties
`DENSE_RANK()`	Rank without gaps after ties
`NTILE(n)`	Divide rows into n equal buckets
`LAG(col, n)`	Return value from n rows before current
`LEAD(col, n)`	Return value from n rows after current
`SUM() OVER()`	Running or partitioned total
`AVG() OVER()`	Running or partitioned average
`COUNT() OVER()`	Running or partitioned count
`MIN() OVER()`	Minimum within the window
`MAX() OVER()`	Maximum within the window

Key Takeaways

Joins:

Joins combine data from multiple tables using a shared column defined in the ON clause.
INNER JOIN — only rows that match on both sides.
LEFT JOIN — all rows from the left, NULL where there's no match on the right.
FULL OUTER JOIN — all rows from both sides, no exceptions.
SELF JOIN — a table joined to itself, great for hierarchical data.
Always use aliases to keep join queries clean and unambiguous.

Window Functions:

GROUP BY collapses rows into summaries. Window functions add calculations to each row without removing any.
PARTITION BY creates groups but keeps all individual rows intact.
ORDER BY inside OVER() is separate from your query-level ORDER BY.
You can't filter window functions in WHERE — wrap them in a CTE first.
ROW_NUMBER for unique ranks, RANK for gaps on ties, DENSE_RANK for no gaps.

If a specific part didn't click, drop it in the comments — happy to break it down further.

DEV Community