Adam Furmanek for Metis

Posted on Oct 24, 2023 • Originally published at metisdata.io

Understanding join strategies in PostgreSQL

#database #sql #postgres

One of the good practices for database design is normalization. We decompose data into multiple tables to avoid duplication and make data storage more organized. As an outcome, we need to join tables when querying the data. SQL engine needs to calculate the result of the join operation, and there are multiple join strategies (algorithms) that can be used. In this blogpost we’ll understand typical join algorithms and when they’re used.

Overview

We’re going to use the demodb. We at Metis provide a docker container with this database.
Specifically, we’ll just use tables ticket_flights with ~8 million rows and flights with ~200 thousand rows. We can take a look at the schema:

Table flights has a primary key configured on the flight_id field. This means that the table is stored as a B-tree. Similarly, ticket_flights has a primary key configured on the tuple (ticket_no, flight_id).

Before moving on, let’s also set parallel scans to 0 with the following query:

SET max_parallel_workers_per_gather = 0;

Parallel scans don’t change the algorithms, so we can ignore them in the scope of this article.

Nested Loop Join

First and the simplest join strategy is called Nested Loop Join. It can be depicted with the following pseudocode:

For row1 in table1:
                For row2 in table2:
                                If (row1 == row2):
                                                Add_ to_result(row1, row2)

We iterate over both tables with two loops, and join them naively. This has quadratic time complexity O(size(table1) * size(table2)). The memory complexity is O(1).
Let’s now see that in action. Take this query:

EXPLAIN
SELECT *
FROM ticket_flights AS tf
JOIN flights AS f ON f.flight_id < tf.flight_id

And here is the plan we obtained:

Nested Loop  (cost=0.42..16532324955.82 rows=601044021228 width=95)
  ->  Seq Scan on ticket_flights tf  (cost=0.00..153851.52 rows=8391852 width=32)
  ->  Index Scan using flights_pkey on flights f  (cost=0.42..1253.81 rows=71622 width=63)
        Index Cond: (flight_id < tf.flight_id)

We can see that the engine decided to use an index to scan the flights table. No index was used to scan the ticket_flights table which is bad - scanning the whole table requires reading all of the rows which amounts to plenty of data. We generally always want to avoid scanning the whole table when filtering, but we want to read as little data as possible. Next, both of these scans are joined with the Nested Loop algorithm.

Nested Loop  (cost=0.42..16532324955.82 rows=601044021228 width=95)
  ->  Seq Scan on ticket_flights tf  (cost=0.00..153851.52 rows=8391852 width=32)
  ->  Index Scan using flights_pkey on flights f  (cost=0.42..1253.81 rows=71622 width=63)
        Index Cond: (flight_id > tf.flight_id)

Let’s see if changing the order of joins matters. Take this query:

EXPLAIN
SELECT *
FROM flights AS f
JOIN ticket_flights AS tf ON f.flight_id < tf.flight_id

And here is the plan we get:

Nested Loop  (cost=0.42..16532324955.82 rows=601044021228 width=95)
  ->  Seq Scan on ticket_flights tf  (cost=0.00..153851.52 rows=8391852 width=32)
  ->  Index Scan using flights_pkey on flights f  (cost=0.42..1253.81 rows=71622 width=63)
        Index Cond: (flight_id < tf.flight_id)

The result is the same.

However, we can also see that changing the output aggregation can change the way we scan tables, but doesn’t change the algorithm. Let’s take this query

EXPLAIN
SELECT COUNT(*)
FROM ticket_flights AS tf
JOIN flights AS f ON f.flight_id < tf.flight_id

We get the following plan:

Aggregate  (cost=18034924512.89..18034924512.90 rows=1 width=8)
  ->  Nested Loop  (cost=0.42..16532314459.82 rows=601044021228 width=0)
        ->  Seq Scan on ticket_flights tf  (cost=0.00..153851.52 rows=8391852 width=4)
        ->  Index Only Scan using flights_pkey on flights f  (cost=0.42..1253.81 rows=71622 width=4)
            Index Cond: (flight_id > tf.flight_id)

We can see that now we scan the flights table with Index Only Scan. This operation doesn’t even need to read the rows, it can get everything from the index which makes this operation even faster than the Index Scan. Next, scans are once again joined with the Nested Loop operation, and finally the Aggregate operation is executed to select the count of the rows.

Hash Join

The next strategy is called Hash Join. The Hash Join algorithm consists of two phases. In the first phase we build a hashtable from one of the tables that we want to join. In the second phase we iterate over the rows of the latter table, and then find the match in the hashtable. The algorithm looks like this:

For row1 in table1:
hashtable.add(row1.id, row1)

For row2 in table2:
                Row1 = hashtable.get(row2.id)
                If (row1 == row2):
                                Add_to_result(row1, row2)

The complexity is O(size(table1) + size(table2)) if we assume that the hashing algorithm is good and we have O(1) lookup time. The memory complexity is O(size(table1)) so the order matters. The engine generally prefers to hash the smaller table.

Let’s see that in action:

EXPLAIN
SELECT COUNT(*)
FROM ticket_flights AS tf
JOIN flights AS f ON f.flight_id = tf.flight_id

This is the plan:

Hash Join  (cost=9767.51..302691.07 rows=8391852 width=95)
  Hash Cond: (tf.flight_id = f.flight_id)
  ->  Seq Scan on ticket_flights tf  (cost=0.00..153851.52 rows=8391852 width=32)
  ->  Hash  (cost=4772.67..4772.67 rows=214867 width=63)
        ->  Seq Scan on flights f  (cost=0.00..4772.67 rows=214867 width=63)

We can see that the engine decided to scan the flights table and then build a hash table out of it. Next, it iterates over the ticket_flights table and matches the rows based on the condition.

If we swap the join order in the SQL query like this:

EXPLAIN
SELECT *
FROM flights AS f 
JOIN ticket_flights AS tf ON f.flight_id = tf.flight_id

then we get exactly the same plan:

Hash Join  (cost=9767.51..302691.07 rows=8391852 width=95)
  Hash Cond: (tf.flight_id = f.flight_id)
  ->  Seq Scan on ticket_flights tf  (cost=0.00..153851.52 rows=8391852 width=32)
  ->  Hash  (cost=4772.67..4772.67 rows=214867 width=63)
        ->  Seq Scan on flights f  (cost=0.00..4772.67 rows=214867 width=63)

The engine is allowed to do so. SQL queries are declarative, so they define what the result is, but they don’t dictate how the result is calculated.

However, if we add the aggregation:

EXPLAIN
SELECT COUNT(*)
FROM ticket_flights AS tf
JOIN flights AS f ON f.flight_id = tf.flight_id

Then we get this plan:

Aggregate  (cost=271560.70..271560.71 rows=1 width=8)
  ->  Hash Join  (cost=8298.51..250581.07 rows=8391852 width=0)
        Hash Cond: (tf.flight_id = f.flight_id)
        ->  Seq Scan on ticket_flights tf  (cost=0.00..153851.52 rows=8391852 width=4)
        ->  Hash  (cost=4772.67..4772.67 rows=214867 width=4)
              ->  Seq Scan on flights f  (cost=0.00..4772.67 rows=214867 width=4)

You can see that the flights table is still scanned as before. There is no index scan this time.

Merge Join

Merge join algorithm is used when we can iterate the rows in order. It works like this:

Table1_sorted = table1.sort();
Table2_sorted = table2.sort();
Row1 = table1_sorted.first()
Row2 = table2_sorted.first()

while row1 is not Null and row2 is not Null:
                while row1 >= row2:
                                if row1 == row2:
                                                Add_to_result(row1, row2)
                                Row2++
                Row1++

The time complexity is O(size(table1)*log(size(table1)) + size(table2)*log(size(table2)) + size(table1) + size(table2)). The memory complexity is O(size(table1) + size(table2)).

However, if the data is already ordered, then we get O(size(table1) + size(table2)) for time complexity and O(1) for memory complexity.

Let’s see that in action. First, disable the hash join strategy:

SET enable_hashjoin = off;

And then run this query:

EXPLAIN
SELECT *
FROM ticket_flights AS tf           
JOIN flights AS f ON f.flight_id = tf.flight_id

We get the following plan:

Merge Join  (cost=1520511.52..1676140.76 rows=8391852 width=95)
  Merge Cond: (f.flight_id = tf.flight_id)
  ->  Index Scan using flights_pkey on flights f  (cost=0.42..8245.57 rows=214867 width=63)
  ->  Materialize  (cost=1520506.91..1562466.17 rows=8391852 width=32)
        ->  Sort  (cost=1520506.91..1541486.54 rows=8391852 width=32)
            Sort Key: tf.flight_id
              ->  Seq Scan on ticket_flights tf  (cost=0.00..153851.52 rows=8391852 width=32)

We can see that the engine had to sort the ticket_flights table after scanning it. However, the flights table was already sorted because it has the B-tree already built for the primary key.
The reason ticket_flights table needs to be sorted is because the primary key consists of the ticket number and the flight id. However, the order of fields matters, so the flight id may not be stored in order.

Summary

The engine can choose how to calculate the join of two tables. Various algorithms have different time and memory complexities, so it’s useful to understand how we can speed things up. We can do that by adding indexes or making sure that we use conditions that allow us to use the more efficient join operations.

DEV Community

Understanding join strategies in PostgreSQL

Overview

Nested Loop Join

Hash Join

Merge Join

Summary

Top comments (0)

Read next

Three Tier Architecture. Overview and Each tier explained.

Exploring new AWS Aurora DSQL. What is it ? Why it is important ? How to quickstart ?

Redis

Database Indexing: A Comprehensive Guide for All Levels