loading...
Cover image for SQL Joins without the Venn diagrams

SQL Joins without the Venn diagrams

helenanders26 profile image Helen Anderson Updated on ・3 min read

Data Analysis (14 Part Series)

1) Speed up your queries with indexes 2) Resources for beginner data analysts 3 ... 12 3) SQL concepts from A to Z 4) 6 ways to search with System Tables 5) Pearl Jam, Nirvana and Beautiful Data Visualisations 6) Entity Relationship Diagrams explained by Sonic the Hedgehog 7) How to use the VLOOKUP Excel function 8) SQL Joins without the Venn diagrams 9) SQL aggregate functions and Mariah Carey 10) Why I use SQL CTEs 11) How to customise SQL Server Management Studio (SSMS) 12) Have you VACUUMed your tables lately? 13) Why you need SQL window functions (part 1) 14) Why you need SQL window functions (part 2)

JOINs are the first big learning curve after getting your head around SQL basics. You may not find everything you need in one table so it's important to know which JOIN clause to use to get the data you need.

While there are plenty of guides to JOINs that use Venn diagrams to get the point across, I prefer something more visual.


INNER JOIN
LEFT JOIN
CROSS JOIN
UNION
UNION ALL


INNER JOIN

Use an INNER JOIN, shortened to 'JOIN', when you want to find the match between two tables. You need to have a column on both tables that you join ON, and that's where the match happens. Any results where there is not a match are discarded.

inner


select
o.order_item, 
i.inventory_item
from orders o
inner join inventory i 
on o.order_item = i.inventory_item


LEFT JOIN

Use a LEFT JOIN when you want to find a match between two tables, but also show a NULL where there is no match from the right table. A RIGHT JOIN does the same but in reverse.

Like the INNER JOIN, you need a column to join ON. Unlike the INNER JOIN, a NULL is used to show there is no match between the two tables.

outer

select
  o.order_item, -- this is the left table
  i.inventory_item -- this is the right table
from 
  orders o
left join inventory i 
on o.order_item = i.inventory_item

CROSS JOIN

A CROSS JOIN joins everything with everything. There is no need to provide a column to join on, and it can result in a very big data set, (and a really big image so you’ll have to use your imagination when reviewing the image below).

Proceed with caution.

cross

select
  o.order_item, 
  i.inventory_item
from 
  orders o
cross join inventory i

UNION

While a JOIN combines columns horizontally a UNION combines rows vertically. This is technically a set operator rather than a JOIN clause but as we are talking about combining datasets this is a good opportunity to introduce it.

Using a UNION combines the result of two queries into one column and removes duplicates.

If your query has multiple columns, they need to be in the same order to complete the UNION.

union

select *
from
  orders
union
select *
from
  inventory

UNION ALL

The UNION ALL combines the results of two queries the same as a UNION but keeps the duplicates in the result.

unionall

select *
from
  orders
union all
select *
from
  inventory

The best way to get your head around JOINs is to start using them. If you aren't working with a SQL database already, check out SQLZoo or Hacker Rank to experiment with JOINs.


Read more


This post first appeared on helenanderson.co.nz

Data Analysis (14 Part Series)

1) Speed up your queries with indexes 2) Resources for beginner data analysts 3 ... 12 3) SQL concepts from A to Z 4) 6 ways to search with System Tables 5) Pearl Jam, Nirvana and Beautiful Data Visualisations 6) Entity Relationship Diagrams explained by Sonic the Hedgehog 7) How to use the VLOOKUP Excel function 8) SQL Joins without the Venn diagrams 9) SQL aggregate functions and Mariah Carey 10) Why I use SQL CTEs 11) How to customise SQL Server Management Studio (SSMS) 12) Have you VACUUMed your tables lately? 13) Why you need SQL window functions (part 1) 14) Why you need SQL window functions (part 2)

Posted on by:

helenanders26 profile

Helen Anderson

@helenanders26

Making applications go faster at Raygun, AWS Data Hero, and tag moderator on Dev.to. Database concept you don’t understand? Let me know, I’ll write a post!

Discussion

markdown guide
 

To start with, I generally like the look of this explanation as an article.

I'm going to be picky and cruel in saying this: but I think it's a disservice to not show that even inner joins enact row multiplications when there are multiple instances of the join value.

It's fair enough to not do so in the very first example of an inner join. Alas, having the multiplication of rows then only be shown later when the cross join is taught leaves a dreadful legacy.

I have spent many hours re-teaching analysts - who have sometimes been writing SQL for years - that have not been aware that repeated values on either side of the inner join can cause row expansions in the result. As I am wont to say - if something can happen at all in data, then at some time it will happen.

I think the basic relational fact of row expansions needs to be taught as soon as the inner join has been taught. I find that helps avoid a lot of misunderstandings about left joins too. I'm not saying it's easy, even less so in static print/screen material - I usually have the luxury of doing live peer guidance with a whiteboard, so I don't have a magic solution to proffer. I would suggest going to the bother of doing a second example of the inner join but showing a value in there twice on one side and thrice on the other (because a 2-1 or a 2-2 leaves room for the learner to misunderstand what's happening).

Of course, there will be some environments that new data analysts may encounter where enough tables have primary keys and so row multiplying joins might be rare. But I'd have to say that over my years of experience that this is becoming less common not more. Indeed I've been seeing "big data" users getting advice to not bother with key constraints for performance reasons and newer platforms such as Hadoop having fewer available constraints anyway.

I certainly don't enjoy dealing with multi-megabyte SQL scripts in which there is an unknown spot of row multiplication to be debugged. The most subtly pernicious is the combination of the coders thinking that join values are unique but having done nothing to ensure that they are - and when the data reality is that the data is only very-nearly-always unique.

While it can seem that such problems are really about managing the uniqueness in the tables I do think that a naiveté about inner joins allows problems to spread like an infection in a data processing sequence. I think that such an innocence comes from not being shown inner join row multiplications in the tuition phase.

 

In 20 years of writing SQL I have never found a legitimate use for RIGHT JOIN in production code. RJ indicates you don't understand your data as it can always be rewritten as a LJ.

One issue that often catches new users is the filter predicate. If you apply it to the Join clause on a LJ you only drop the joined record,; if you add it to the WHERE clause you drop the entire row.

 

Great advice! Thanks for sharing :)

 

The colors make this perfect. So clear and understandable