GRACE MUTHONI MWANGI

Posted on Mar 2 • Edited on Mar 7

Mastering joins and window functions on PostgreSQL

#database #postgres #sql #tutorial

JOINS

Joins are SQL clauses used to combine rows from two or more tables (or views) based on a related column.

There are numerous join types in PostgreSQL. The way to determine the best join to select mainly depends on the following:

The output that you would like displayed from the tables that are joined, including the unmatched records.
The relationship between the tables and/or views, including the common columns.

Data tables used for illustration purposes are:

Types of joins on PostgreSQL:

Cross join - This join combines every row from one table with every row of the other table.

For example;

select p.project_name, e.salary, e.name, e.employee_id from sales_data.projects p cross join sales_data.employees e ;

This tends to create a multiplication of the results based on the row counts of the 2 tables.
The 2 tables each have 5 rows. The end result generated by the code has 25 rows.
This join type doesn't require the on statement that matches the common column between the 2 tables; many term it as a non-matching join.

Inner join - This join typically returns the matching values from both tables. This simply means that if there is no match in either table, then the row is excluded from the results.

For example;
select emp.name, dep.department_id, dep.department_name from sales_data.employees emp inner join sales_data.departments dep on emp.department_id = dep.department_id;

The employee called Eve is not yet assigned to any department; therefore, her name is excluded from the results.
Additionally, the Finance department doesn't have an employee linked to it and is therefore also not displayed in the query's result.

Left outer join - This join, also referred to as a left join, typically returns all the rows from the left table and the matching values from the right table (mentioned 2nd). Rows from this table appear based on the results of the table mentioned 1st, which is on the left. For example:

select * from sales_data.projects p left outer join sales_data.employees e on p.employee_id = e.employee_id;

select * from sales_data.projects p left join sales_data.employees e on p.employee_id = e.employee_id;

The left outer join from this code returns all the rows from the project table, including the projects that aren't assigned to any particular employee.
Additionally, not all employees appear in the query result because some employees don't have projects assigned to them yet.

Right outer join - This join, also referred to as a 'right join', typically returns all rows on the right (table mentioned last) and the matching values from the left table. For example:

select * from sales_data.projects p right outer join sales_data.employees e on p.employee_id = e.employee_id;

select * from sales_data.projects p right join sales_data.employees e on p.employee_id = e.employee_id;

The right outer join from this code returns all the rows from the Employees table positioned on the right.
This simply means that all employees are listed and the projects they are assigned to; however, not all projects have been captured, as they aren't assigned to any employee yet.

Full outer join - This join, also referred to as a full join, typically returns all the rows from the tables in the query. The values of the first table mentioned typically appear on the left. For example:

select * from sales_data.projects p full outer join sales_data.employees e on p.employee_id = e.employee_id;

select * from sales_data.projects p full join sales_data.employees e on p.employee_id = e.employee_id;

This code returns all the rows from the project table first, and then extra rows containing a record of the employees who aren't yet assigned to projects yet.

Natural join - This join, only matching values from the tables referenced. It automaticcally ensures tables are joined on the basis of the primary key and the foreign key of the 2nd table. The only limitation is that these columns have to share a name. For example:

select * from sales_data.employees e natural join sales_data.projects p;

Enriching join queries

You can additionally enrich the joins by adding different functions. For example;

WHERE clause:

Using the WHERE clause to filter values from the tables as shown below:

select * from sales_data.projects p full join sales_data.employees e on p.employee_id = e.employee_id where e.employee_id <4;

In the query above, the where clause filters by the employee_id column, only displaying rows containing employee_ids that are lower than 4.
It is importan to note that the where clause only filters rows; it cannot work on an aggregate column.
If the column is an aggregrate then it is best to use "Having" clause e.g

select c.first_name, c.last_name, SUM(s.quantity_sold) as units_sold from assignment.customers c inner join assignment.sales s on c.customer_id = s.customer_id group by s.customer_id,c.first_name, c.last_name having SUM(s.quantity_sold) > 5;

ORDER BY clause

This clause helps to sort the data within the query, therefore returning results that are in ascending/descending order.
This clause is best when returning numeric values, and you can order the values to see the highest or the lowest.
Additionally, you can order text values to have the values ordered alphabetically, like in the example below:

select * from sales_data.projects p full join sales_data.employees e on p.employee_id = e.employee_id order by e.employee_id asc nulls last, p.project_name asc;

Window Functions

These are functions that perform a set of calculations across a set of rows related to the current row in the window (query result displayed). These window functions are further divided by their data-related functionality. Typically, these functions introduce a new column to the displayed query results.

The window functions are differentiated from other normal functions using the 'over()' clause that instructs PostgreSQL to not collapse rows like 'GROUP BY' would, instead, calculate this function by row.

Ranking Functions on PostgreSQL:

These functions help to assign ranks or row numbers to the displayed results within a partition.

Examples of Ranking Functions:

Row_number – this function gives a unique number to each row. The numbering system is sequential and has no gaps or duplicates. select *, row_number() over() as id_column from sales_data.working_hub;
_Rank _– this function assigns numbers, but with gaps after there is a tie in values referenced. Typically, this will have duplicates, in the numbers displayed for the duplicate values (the ties) select *, rank () over() as id_column from sales_data.working_hub; The code above typically returns all the rows numbered as 1 because the order by is missing(undefined).

select *, rank () over(order by salary desc nulls first) as id_column from sales_data.working_hub;

This code, on the other hand, clearly depicts the working of the rank function. The salary column is used to number the query results.
We have 2 NULL values at the top because of the code: nulls first. These are both numbered as 1, and then the next value is numbered as 3.
The highest value of the column_id that is generated is 7

Dense rank - works like rank but without gaps. Meaning the same value will still be displayed where there is a tie, but the next value will be sequential after the previously assigned number.

select *, dense_rank() over (order by salary desc nulls first) as id_colum from sales_data.working_hub;

The highest value of the column id generated is 6. We have 2 NULL values at the top (nulls first) these are both numbered as 1, and then the next value is numbered as 2.

-_ Ntile_ - works as a quartile function would on MS Excel, equally dividing a certain class in the specified number of times

Aggregate Functions on PostgreSQL:

These are the normal aggregate functions with the 'over' clause, which helps run the functions without grouping the rows.

select *, SUM (e.salary) over () as total_salary from sales_data.projects p full join sales_data.employees e on p.employee_id = e.employee_id order by e.employee_id asc nulls last, p.project_name asc;

There is a new column that is created from the as clause called total_salary.
The contents of the additional column repeat the total salary amount on every row, returning 242k across all rows in the result.
Needless to say, other aggregate functions can also be used, as in the example given above.

Value Navigation Functions on PostgreSQL:

These functions help to access other rows within the result in the window, therefore helping in comparisons based on the data returned in the column.

Lag function - This returns data from the previous cell by the specified column to the current row. For example:

select *, lag(salary) over (order by salary desc nulls first) as previous_data from sales_data.working_hub;

Lead function - This returns data from the cell below (next cell), acting as a relative cell reference. For example:

select *, lead(salary) over (order by salary desc nulls first) as pervious_data from sales_data.working_hub;

First Value - This returns data from the first value displayed in the window, while Last Value returns data from the last cell displayed in the window.

Enriching window functions

Partition by divides the result set into logical groups based on the specified column while preserving individual rows. Unlike a GROUP BY clause, it does not collapse rows but instead allows the window function to perform calculations within each partition. For example:

select *, SUM (e.salary) over (partition by project_name) as total_salary from sales_data.projects p full join sales_data.employees e on p.employee_id = e.employee_id order by e.employee_id asc nulls last, p.project_name asc;

The contents are returned as would a SUMIF function if it were performed on Excel. So, the sum range is the salary column, the criteria range is the project_name column, and the criteria is the exact department displayed on that specific row.