John Wakaba

Posted on Mar 4

# Understanding Joins and Window Functions in SQL

#sql #data #database #postgres

When working with relational databases, data is rarely stored in one single table. Instead, it is organized into multiple related tables to reduce redundancy and improve structure.

In a typical transactional database, you might have:

customers → customer information
orders → order transactions
books → product details

To analyze meaningful business insights, we must first combine this data. That’s where SQL Joins come in.

Once the data is combined, we can apply Window Functions to perform advanced analysis such as ranking, running totals, and trend comparisons.

This article walks you through both concepts in a logical flow — starting with joins and finishing with window functions.

SQL JOINS

Why Joins Matter

Relational databases are built on relationships.

For example:

A customer places many orders.
An order references one book.
A book can appear in many orders.

To analyze this properly, we must join the tables together using a shared key — usually customer_id or book_id.

INNER JOIN

An INNER JOIN returns only rows that exist in both tables.

If we want to see which customers placed orders:

SELECT o.order_id,
       c.first_name,
       c.second_name,
       o.order_date,
       o.quantity
FROM orders o
INNER JOIN customers c
    ON o.customer_id = c.customer_id;

✔ Returns only customers who have placed orders

❌ Excludes customers with no orders

This is the most commonly used join in data analysis.

LEFT JOIN (LEFT OUTER JOIN)

A LEFT JOIN returns:

All rows from the left table
Matching rows from the right table
NULL where no match exists

Example: Show all customers, even those who haven’t ordered anything.

SELECT c.customer_id,
       c.first_name,
       c.second_name,
       o.order_id,
       o.quantity
FROM customers c
LEFT JOIN orders o
    ON c.customer_id = o.customer_id;

✔ Every customer appears

✔ Customers without orders will show NULL in order columns

This is extremely useful for identifying inactive customers.

RIGHT JOIN

A RIGHT JOIN does the opposite of a LEFT JOIN.

SELECT c.first_name,
       c.second_name,
       o.order_id
FROM customers c
RIGHT JOIN orders o
    ON c.customer_id = o.customer_id;

This ensures all orders appear, even if customer data is missing.

In practice, RIGHT JOIN is less common because we can usually rewrite it using LEFT JOIN by switching table order.

FULL JOIN (FULL OUTER JOIN)

A FULL JOIN returns all rows from both tables.

SELECT c.first_name,
       c.second_name,
       o.order_id
FROM customers c
FULL JOIN orders o
    ON c.customer_id = o.customer_id;

✔ Shows matches

✔ Shows unmatched customers

✔ Shows unmatched orders

This is useful for data auditing and reconciliation.

Joining Multiple Tables

Real-world analysis often requires more than two tables.

Example: See which customer ordered which book.

SELECT c.first_name,
       c.second_name,
       b.title,
       o.quantity,
       o.order_date
FROM orders o
JOIN customers c
    ON o.customer_id = c.customer_id
JOIN books b
    ON o.book_id = b.book_id;

Now we can see:

Customer name
Book title
Quantity ordered
Order date

This joined dataset becomes the foundation for deeper analysis.

CROSS JOIN

A CROSS JOIN produces all possible combinations between two tables.

SELECT c.first_name,
       b.title
FROM customers c
CROSS JOIN books b;

If you have:

10 customers
5 books

You get 50 rows.

This is useful when generating combinations for simulations or recommendation systems.

Anti-Join (Finding Missing Records)

SQL doesn’t have a direct ANTI JOIN, but we simulate it using LEFT JOIN + NULL filtering.

Example: Find customers who have never placed an order.

SELECT c.customer_id,
       c.first_name,
       c.second_name
FROM customers c
LEFT JOIN orders o
    ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

This is powerful for churn analysis and business reporting.

How Joins Prepare Data for Window Functions

Notice something important:

Most advanced SQL analysis begins like this:

FROM orders o
JOIN customers c ON o.customer_id = c.customer_id

Why?

Because:

Joins combine related data
Window functions analyze that combined dataset

Joins prepare the structure.

Window functions perform the analytics.

Now that we understand joins, let’s move into advanced analytics.

WINDOW FUNCTIONS

They allow you to perform calculations across a set of table rows that are somehow related to the current row.

Window functions can be used to

Rank Rows.
Calculate cumulative totals.
Find the difference between consecutive rows in a dataset.

Window functions return a value for each row while still providing information from the related rows.

ROW_NUMBER ()

Assign a unique row number to each row in the result set.

In a real world scenario it can help us track which order each customer made first, second.......

Assigns a unique number to each row, starting from 1 based on the order specified by the ORDER BY clause.

The number will reset for each position if PARTITION BY is used.

Assign a unique row number to each order based on the order date and we want to reset numbering for each customer

SELECT o.order_id, c.first_name, c.second_name, o.order_date,
 ROW_NUMBER() OVER (PARTITION BY o.customer_id ORDER BY o.order_date) AS row_num
 FROM orders o
 JOIN customers c ON o.customer_id = c.customer_id;

ROW_NUMBER () : Assigns a unique number to each order.

PARTITION BY o.customer_id : Ensures that the row numbering starts fresh for each customer.

Query will list orders for each customer showing their row number (1,2,3---) in the sequence of orders.

Ranking the orders globally based on order date without resetting the numbering for each customer

What is needed just remove the PARTITION BY Clause

SQL Query Without Resetting Row Number:

SELECT o.order_id, c.first_name, c.second_name, o.order_date,
 ROW_NUMBER() OVER (ORDER BY o.order_date) AS row_num
 FROM orders o
 JOIN customers c ON o.customer_id = c.customer_id;

Without PARTITION BY the numbering is continous for all orders across customers based on their order_date.

RANK() AND DENSE_RANK()

RANK() assigns a rank to each row, with ties getting the same rank but leaving gaps in subsequent ranks.

DENSE_RANK() works similarly but without leaving gaps in the ranking.

RANK() SQL QUERY

Rank the customers based on the total quantity of books as they are ordered

SELECT c.first_name, c.second_name, SUM(o.quantity) AS total_quantity,
 RANK() OVER (ORDER BY SUM(o.quantity) DESC) AS rank
 FROM orders o
 JOIN customers c ON o.customer_id = c.customer_id
 GROUP BY c.first_name, c.second_name;

RANK() assigns a rank based on the SUM(o.quantity) in descending order.

If two customers have the same total quantity ordered they will receive the same rank and the next rank will have a gap. *Two customers rank 1 will result in the next customer being ranked 3rd.

USING DENSE_RANK()

Assigns a rank without gaps for ties.

Assigns a rank to each row but it does not leave gaps in the rankings if there are ties.

Calculate the dense rank of customers based on the total quantity of books they ordered

SELECT c.first_name, c.second_name, SUM(o.quantity) AS total_quantity,
 DENSE_RANK() OVER (ORDER BY SUM(o.quantity) DESC) AS dense_rank
 FROM orders o
 JOIN customers c ON o.customer_id = c.customer_id
 GROUP BY c.first_name, c.second_name
 ORDER BY dense_rank;

If two customers are tied they will receive the same rank but the next customer will receive the next consecutive rank.

1, 1, next will be 2nd

This key difference in RANK() and DENSE_RANK() is crucial for how you want to treat tied values in your analysis.

LEAD() AND LAG()

LEAD()

LEAD() access next row's value.

It is used to access a row that follows the current row at a specific physical offset.

Generally employed to compare the value of the current row with the value of the next row following the current row.

Compare quantity ordered by each customer in the current row with the quantity ordered in the next row

SELECT o.order_id, o.customer_id, o.quantity,
 LEAD(o.quantity) OVER (ORDER BY o.order_id) AS next_quantity
 FROM orders o;

LEAD(o.quantity) allows you to access the quantity of the next row for each customer.

Query gives the quantity ordered by the customer in the current row and the quantity ordered by the same customer in the next row.

For the last row for each customer the next quantity will be NULL because there is no next row.

LAG()

Access previous rows value

It is crucial for analyzing trends or behavior change over time.

Allows you to access data from a previous row within the same result set and is crucial for comparing values in the current row with the preceding row.

Operates on partitions created by the PARTITION BY clause.

Compare quantity ordered by each customer in the current row with the quantity ordered in the previous row using the LAG() function

SELECT o.order_id, o.customer_id, o.quantity,
 LAG(o.quantity) OVER (ORDER BY o.order_id) AS prev_quantity
 FROM orders o;

LAG(o.quantity) allows you to access the quantity ordered in the previous row for each customer

The query shows the quantity ordered in the current row and the quantity ordered in the previous row for the same customer.

First row previous quantity will be NULL as there is no previous row.

NTILE() FUNCTION

Partitions data into specified number of buckets.

Crucial for data analysis and reporting as it allows users to efficiently distribute rows and analyze data in a structured manner.

We want to divide customers into 2 groups (quartiles) based on their total order quantity.

SELECT c.first_name, c.second_name, SUM(o.quantity) AS total_quantity,
 NTILE(2) OVER (ORDER BY SUM(o.quantity) DESC) AS quantity_tile
 FROM orders o
 JOIN customers c ON o.customer_id = c.customer_id
 GROUP BY c.first_name, c.second_name
 ORDER BY quantity_tile;

NTILE(2) divides customers into two equal groups (quartiles) based on their total quantity ordered.

PARTITION BY

Divides result set into partitions to apply window functions independently.

This clause divides the result set into partitions and the window function works independently within each partition.

calculate the total quantity of orders for each customer and the average price of the books ordered by each customer.

SELECT c.first_name, c.second_name,
 SUM(o.quantity) AS total_quantity,
 AVG(b.price) AS avg_price,
 SUM(o.quantity) OVER (PARTITION BY o.customer_id) AS total_order_quantity
 FROM orders o
 JOIN customers c ON o.customer_id = c.customer_id
 JOIN books b ON o.book_id = b.book_id
 GROUP BY c.first_name, c.second_name, o.quantity, o.customer_id
 ORDER BY c.first_name;

SUM(o.quantity) gives total quantity ordered by each customer.

PARTITION BY o.customer_id ensures the total order quantity is calculated for each individual customer.

DEV Community