Dishon Gatambia (Dd)

Posted on Mar 8

SQL ANALYTICAL MECHANICS: JOINS & WINDOW FUNCTIONS

#database #sql #analytics #data

Structured Query Language (SQL) is the fundamental protocol for relational database management and data retrieval. Sophisticated analysis relies on two primary mechanisms: Joins and Window functions. Joins allow you to combine data from multiple tables, while window functions enable advanced calculations over subsets of data without collapsing rows. This technical overview details implementation strategies and optimisation for practitioners with foundational SQL knowledge.

Feature	Primary Mechanism	Result Set Impact
Joins	Horizontal merging of distinct tables via shared keys.	Alters row count and structure based on matching logic.
Window Functions	Analytical computations across defined row subsets (windows).	Preserves original row count; appends calculated data.

Core Principles

Joins: Essential for reconstructing normalized data structures. Operations include INNER, LEFT, RIGHT, and FULL OUTER joins to define the scope of the intersection or union.
Window Functions: Utilised for ranking (RANK()), running totals (SUM() OVER), and time-series analysis (LAG/LEAD). The OVER clause defines the logic through PARTITION BY and ORDER BY.

What a Join actually does

A JOIN is the mechanism by which a relational database combines rows from two or more tables based on a related column. The engine performs a logical comparison between rows of a left and right dataset, then determines which row combinations satisfy the specified condition.
The key insight: a JOIN does not "add columns to a table." It builds a new, temporary result set. Every row in that result is a pairing of rows from the participating tables. How many rows get included, and from which side, depends on the join type. Without joins, you'd be limited to single-table queries, which rarely suffice in real-world scenarios like e-commerce databases or customer relationship management systems.

Types of Joins

Inner Joins

Returns only the rows where the join condition is satisfied in both tables. Rows that have no match on either side are excluded entirely.

select e.name, d.department_name
from employees e 
inner join 
departments d on e.department_id = d.department_id;

What's happening - For every row in the employees table, the query looks for matching rows in the departments table. If no match exists, that employee does not appear in the output. If three departments exist for one employee, that employee appears three times.

Left Join (Left Outer Join)

Returns all rows from the left table, and the matched rows from the right table. Where no match exists on the right, NULL fills the right-side columns.

select e.name, d.department_name
from departments d 
left join employees e on d.department_id = e.department_id;

When to use it: Any time "missing" data is meaningful. If d.department_name is NULL, you know that employee has no assigned department. This is the join type to use for finding gaps.

--Find employees with no assigned departments
select e.name, d.department_name
from departments d 
left join employees e on d.department_id = e.department_id
where d.department_name = NULL;

This returns employees with no assigned departments.

Right Join

The mirror of LEFT JOIN. All rows from the right table are preserved; NULLs appear on the left where no match exists. In practice, most engineers rewrite RIGHT JOINs as LEFT JOINs by reversing table order because it's cleaner.

--Finding employees and all departments (including Empty departments)
select e.name, d.department_name
from employees e 
right join departments d on e.department_id = d.department_id;

Full Outer Join

Returns all rows from both tables. Where no match exists on either side, NULLs fill the unmatched columns. This union of both outer joins. In short, it shows all rows from all tables.

--Shows all rows from both tables
select e.name, d.department_name
from employees e 
full outer join departments d on e.department_id = d.department_id;

When to use it: Data reconciliation, audits, finding orphaned records on either side.

Cross Join

Produces the Cartesian product. Every row in the left table is paired with every row in the right table. No ON condition.

--Every employee with every project 
select e.name, p.project_name 
from employees e 
cross join projects p;

A table with 10 rows crossed with a table of 5 rows produces 50 rows. Use deliberately. Running a CROSS JOIN on two large tables without a WHERE clause can produce billions of rows.

Self Join

A table joined to itself. The table is aliased twice to distinguish which "instance" is being referenced.

--Employee who is a manager 
select distinct m.name as manager 
from employees e 
join employees m on e.manager_id = m.employee_id;

Use cases for Joins

Reporting: Combine customers' details from a customers table with their orders from an orders table to generate sales reports.
Data Integration: Merge data from disparate sources, such as user profiles and activity logs.
Data Cleaning: Identify and handle missing relationships, like orphaned records.

Best practices include using aliases for table names to improve readability, indexing join columns for performance, and avoiding unnecessary joins to prevent query slowdowns.

Window Functions

Window functions, introduced in SQL in 2003 and supported by most modern databases like PostgreSQL, MySQL (from version 8.0), and SQL Server, perform calculations across a set of rows related to the current row. Unlike aggregate functions (e.g., SUM(), AVG()) that group rows and reduce output, window functions maintain the original row count while adding computed columns.
With GROUP BY + SUM(), you get one row per group. With a window function SUM() OVER (PARTITION BY ...), every original row survives in the output, but each row gains a new column containing the aggregate value computed over its "window".
A window function is defined using the OVER() clause, which specifies the "window" or partition of rows to operate on. It can include:

PARTITION BY: Divides the result set into partitions (groups) where the function is applied independently.
ORDER BY: Sorts rows within each partition.
Frame Clause: Defines a subset of the partition (e.g., ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW for running totals).

Partition By

Divides the result set into groups (partitions). The window function restarts its calculation for each partition. This is analogous to GROUP BY, except the rows are not collapsed.

-- Total sales per region, shown on every row
select
    region,
    salesperson,
    sales_amount,
    sum(sales_amount) over (partition by region) as region_total
from sales;

Output will have one row per salesperson, but region_total shows the sum for all salespeople in that region.

ORDER BY Inside OVER

When ORDER BY is specified inside OVER(), the window function becomes aware of row sequence within each partition. For ranking functions, this determines rank. For running totals, it defines the cumulative direction.

-- Running total of sales per region, ordered by date
select
    region,
    sale_date,
    sales_amount,
    sum(sales_amount) over (
        partition by region
        order by sale_date
    ) as running_total
FROM sales;

Types of Window functions

Ranking functions

Assigns ranks or numbers to rows

Row_Number()

Assigns a unique sequential integer to each row within a partition. Ties receive different numbers (non-deterministic unless the ORDER BY is fully deterministic).

-- Rank employees by salary within each department
select
    department,
    employee_name,
    salary,
    row_number() over (
        partition by department
        order by salary desc
    ) as row_num
from employees;

Use case: Deduplication - select WHERE row_num = 1 to keep only the highest-paid employee per department.

Rank()

Like ROW_NUMBER(), but ties receive the same rank, and the next rank skips. A tie at rank 2 means the next rank is 4 (not 3).

select
    department,
    employee_name,
    salary,
    rank() over (
        partition by department
        order by salary DESC
    ) as salary_rank
from employees;

Dense_Rank()

Like RANK(), but no gaps after ties. A tie at rank 2 means the next rank is 3.

select
    department,
    employee_name,
    salary,
    dense_rank() over (
        partition by department
        order by salary desc
    ) as dense_rank
from employees;

Salary	ROW_NUMBER	RANK	DENSE_RANK
100	1	1	1
100	2	1	1
90	3	3	2

NTILE(n)

Divides the partition into n equal buckets and assigns a bucket number to each row.

-- Divide customers into quartiles by lifetime value
select
    customer_id,
    lifetime_value,
    ntile(4) over (order by lifetime_value desc) as value_quartile
from customers;

LAG() and LEAD()

Access values from a previous or subsequent row within the partition, without a self-join.

LAG(): Returns a value from a previous row.
LEAD(): Returns a value from a next row.

-- Month-over-month revenue change
select
    month,
    revenue,
    lag(revenue, 1) over (order by month) as prev_month_revenue,
    revenue - lag(revenue, 1) over (order month) as change
from monthly_revenue;

-- Show what the next month's revenue will be
select
    month,
    revenue,
    lead(revenue, 1) over (order by month) as next_month_revenue
from monthly_revenue;

FIRST_VALUE() and LAST_VALUE()

Return the first or last value in an ordered window frame.

-- Show each sale alongside the first and most recent sale in that region
Select
    region,
    sale_date,
    sales_amount,
    first_value(sales_amount) over (
        partition by region
        order by sale_date
    ) AS first_sale,
    last_value(sales_amount) over (
        partition by region
        order by sale_date
        rows BETWEEN unbounded preceding and unbounded following
    ) as last_sale
from sales;

Important: LAST_VALUE() requires an explicit frame clause (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING). Without it, the default frame ends at the current row; so LAST_VALUE() would just return the current row's value.

Window Frames: ROWS vs RANGE

The frame clause defines exactly which rows within the partition are included in the calculation relative to the current row.

-- Rolling 3-row average (current row + 2 preceding)
select
    sale_date,
    sales_amount,
    avg(sales_amount) over (
        order by sale_date
        rows between 2 preceding and current row
    ) as rolling_3_avg
from sales;

-- Rolling 7-day average (based on value range, not row count)
select
    sale_date,
    sales_amount,
    avg(sales_amount) over (
        order by sale_date
        range between interval '6 days' preceding and current row
    ) as rolling_7day_avg
from sales;

ROWS vs RANGE:

ROWS: physical row count
RANGE: value-based range (treats tied ORDER BY values as the same position)

Concept in plain terms

JOIN is the act of stitching tables together. A relational database stores data in separate tables deliberately; orders in one table, customers in another, to avoid redundancy. JOINs are how you reassemble that normalised data into the shape a query needs. The join type controls what happens when no match exists: INNER drops the row, OUTER preserves it with NULLs.
Window functions are the answer to "I need an aggregate, but I don't want to lose the detail rows." The OVER() clause tells the database: compute this value, but do it within a sliding context tied to each row. The result is a hybrid row-level detail that coexists with group-level calculations in the same output. This makes complex analytics expressible in a single query that would otherwise require multiple subqueries, self-joins, or application-level processing.
The two features address different problems. JOINs control the shape and completeness of the dataset. Window functions compute derived metrics across that dataset without changing its grain. Used together, they handle a wide range of analytical query patterns efficiently and without procedural code.

DEV Community