How SQL JOIN Algorithms Work
Ever wondered what's really happening when you execute a SQL JOIN? While it might seem straightforward on the surface, the underlying mechanisms are far more intricate and fascinating. In this article, we'll pull back the curtain and explore the three primary methods SQL employs to perform JOINs, each with its own strengths and ideal use cases.
To set the stage, let's define Na
and Nb
as the number of records in tables A and B respectively. Now, let's embark on our journey through the world of SQL JOIN algorithms.
1. Merge Join
- Sorts both tables by the join key (if not already sorted)
- Compares records, moving pointers along both tables
The overall time complexity is O(Na log Na) + O(Nb log Nb) + O(Na + Nb)
This is the method I wish for everyone when calling EXPLAIN
. It seems that the algorithm complexity is high, but if our values are arranged in ascending order (for example, serial ID), then there's no need for sorting, and we get a complexity of O(Na + Nb)
!
2. Hash Join
- Creates a hash table for one of the tables
- Goes through the second table, looking for matches in the hash table
The overall time complexity is O(Na) + O(Nb)
. But we require a lot of additional memory to store the hash map.
It's used by the DB planner if the table is large and unsorted.
3. Nested Loop Join
In the case of Nested Loop Join, the database compares each record from table A with each record from table B. This leads to an overall time complexity of O(Na * Nb)
- which is quadratic! How could this be allowed? But let's think about why this method exists and is sometimes even more beneficial than the others.
And here's a whole rabbit hole:
- Low memory in the DB (yes, this happens too), then Hash Join becomes too expensive in terms of memory, and Merge Join in terms of time
- When one or both tables involved in the join have a relatively small number of rows, Nested Loop Join can be efficient, as iterating through all rows doesn't require significant resources (no need to create additional entities, which also takes time, as even switching registers in RAM takes time)
BUT HOW DOES IT ALL WORK??
The answer is simple - statistics.
When choosing a method, the database relies on the statistics table from the information schema, but that's not all. If during execution the real cost turns out to be erroneous, the method can be changed, but that's an idea for the next post.
Top comments (0)