SQL 101: All about SQL JOINs

Helen Anderson on November 17, 2018

Joining tables is the first big learning curve after getting your head around SQL basics. More often than not you won't find everything you need... [Read Full]
 
 

My wild guess would be that’s because FULL JOIN is an evidence of a design flaw. If you find yourself doing FULL JOIN your database structure requires in an immediate surgery.

 

That's not at all true. They're very rare in my experience, but even GOTO has its place and there's been much less harm wrought by full joins.

For an example, a data model I'm currently working with involves a 1:1 relationship between stations and sites: a station can be deployed to a site, a site may be empty (no station), or a station may be held in reserve (no site). I want to see all my stations and all my sites, with deployed stations correctly lined up with their sites. A full outer join between stations and sites on stations.site_id produces exactly the desired output. Anything else would be clumsy at best, and the only remotely appropriate alternative model simply moves the foreign key to the other side.

Well, I am not a SQL architect by any mean, but a join table sites_to_stations would not only allow to avoid FULL JOIN but also attach some attributes to links (like linked_at or like) in the future for free.

I already get metadata for free (it's event sourced), so a junction table even with the appropriate constraints to enforce 1:1 vs m:m would just take up space and make other queries more complicated.

Got it. Thanks for the explanation, I was wrong. In this case it indeed sounds as the best solution.

Also, the junction table approach still doesn't give me what I want, since the un-junctioned records collide on the join criteria:

[local] dian#dian= select * from left_table;
 id   val  
────┼───────
  1  one
  2  two
  3  three
(3 rows)

Time: 0.559 ms
[local] dian#dian= select * from right_table;
 id  val 
────┼─────
  1  1
  2  2
  3  3
(3 rows)

Time: 0.587 ms
[local] dian#dian= select * from left_right;
 left_id  right_id 
─────────┼──────────
       1         3
       3         1
(2 rows)

Time: 0.533 ms
[local] dian#dian= select * from left_table l
left outer join left_right lr on lr.left_id = l.id
right outer join right_table r on r.id = lr.right_id;
   id     val    left_id  right_id  id  val 
────────┼────────┼─────────┼──────────┼────┼─────
      1  one           1         3   3  3
      3  three         3         1   1  1
 (null)  (null)   (null)    (null)   2  2
(3 rows)

Time: 0.984 ms

To get unaffiliated lefts and rights to show up, you have to run the query twice and union the results:

[local] dian#dian= select * from left_table l
dian#dian- left outer join left_right lr on lr.left_id = l.id
dian#dian- left outer join right_table r on r.id = lr.right_id
dian#dian- union
dian#dian- select * from left_table l
dian#dian- left outer join left_right lr on lr.left_id = l.id
dian#dian- right outer join right_table r on r.id = lr.right_id;
   id     val    left_id  right_id    id     val   
────────┼────────┼─────────┼──────────┼────────┼────────
      3  three         3         1       1  1
 (null)  (null)   (null)    (null)       2  2
      2  two      (null)    (null)  (null)  (null)
      1  one           1         3       3  3
(4 rows)

Time: 0.697 ms

Indeed. This is a perfect example of how blinkered are we (I :) might be with our background, experience and expertise.

Say, 20 times in my life I saw FULL JOIN in the legacy DBs, that I always could have refactored to the joined tables. I never met the example of a kind you just shared (btw, thank you again and again for taking time doing that, I really appreciate and value it.)

And—voilà—I stupidly made the wrong assumption based on my experience, expertise and all that crap.

Very enlightening.

You're welcome! :) I should say that the case I mentioned with sites and stations is, to my memory, one of maybe two or three times I've actually written a full join in application code over more than a decade of designing databases and writing SQL. It's really rare, but an important tool nonetheless.

 

I have a use case for FULL JOIN.

FULL JOINs are fantastic when, for one reason or another (perhaps you built the database just as a way to combine some external data sources ad-hoc), your database has acquired data in two different tables that includes different fact tables about the same real-world entities.

See my example in my recent blog post "Every SQL Join You'll Ever Need".

You'll see that it's easy to make a table that looks like this:

name_lf company name_fl fav_food
Amjit, Anush Apple
Borges, Benita Boiron
Combs, Cathy CVS Cathy Combs Carrots
Daher, Darweesh Dell Darweesh Daher Doritos
Ellis, Ezra EDF Ezra Ellis Endives
Fulvia, Frances Firestone
Frances Fulvia Fries
Grace Gao Garlic
Helen Hopper Hummus

Out of a table that looks like this:

ssn name_lf ph em company
A1A1A1 Amjit, Anush 1111111 111@example.com Apple
B2B2B2 Borges, Benita 2222222 222@example.com Boiron
C3C3C3 Combs, Cathy 3333333 333@example.com CVS
D4D4D4 Daher, Darweesh 4444444 444@example.com Dell
E5E5E5 Ellis, Ezra 5555555 555@example.com EDF
F6F6F6 Fulvia, Frances 6666666 666@example.com Firestone

And a table that looks like this:

social name_fl phone email fav_food age
C3C3C3 Cathy Combs 3333333 ccc@example.com Carrots 33
D4D4D4 Darweesh Daher 4444444 444@example.com Doritos 44
E5E5E5 Ezra Ellis 5555555 555@example.com Endives 55
FFF666 Frances Fulvia 6666666 666@example.com Fries 66
G7G7G7 Grace Gao 7777777 777@example.com Garlic 77
H8H8H8 Helen Hopper 8888888 888@example.com Hummus 88

It may not be a terribly common business need in a well-normalized long-term data store (or, as you say, one that doesn't "require immediate surgery"), but it's a really common one in ETL / "set up the latest database management wants" work. :-)

I could certainly also see myself using a FULL JOIN in the "stations and sites" types of "what sort of data I'm dealing with" situations Dian described. Other people have a point that often times when a "business need" requires two such tables to be joined so often, one would think they might have attributes of their relationships that need to be stored and therefore need a junction table, so that's worth thinking about as one designs their database! But as Dian points out ... not necessarily.

I'd give that it definitely does seem to be far more common, business-question-wise, to want to see the intersection of two data sets or to see "definitely thing A, but only things B/C/D/etc. if applicable", than it is to want to see "whatever -- but put it together on the same line if it seems related"! :-) But I wouldn't call having a business need to look for such joins a "flaw" by any stretch of the imagination.

 

Aleksei …

FULL JOIN will do basically a LEFT JOIN and a RIGHT JOIN, correct?

Yes, you are correct in that I have never found too many practical uses as … no databases would normally have this situation in production use. But I still think it should be in the list.

There ARE limited cases like linking all reps to all products and finding which reps have NOT sold what products.

Dale

The beauty of development ... there is never a right and wrong! Some things are logical until the day that they aint ... and vice-versa.

Dale

 

It's not always right. Sometime I using this. Common in big complex query with joining subquery

I do it when trying to debug legacy databases. I don't do it in production code though.

 

Hey, nice write up! I think another topic worth mentioning for beginners is a light (but IMO necessary) introduction to the cost of SQL queries. A JOIN is typically is an operation that may take days or just a minute depending on multiple factors — I think that knowing some of those factors (like having indices) would be empowering to beginner devs. We can also argue that this is a more advanced topic. :-)

 

Thanks!

You're absolutely right, it's tricky to keep beginner posts light while covering a lot of those technical factors, like trying to keep efficient and running smoothly.

My next post on the list is on indexes. I'm finding it tricky to keep it at an intro level and more conversational but will hopefully get there in the next few days.

 

Yep, working with indexes is a very important topic. They can downright make or break the performance of a database! I think it's a good idea to rely on a lot of diagrams, and perhaps animated GIFs explaining how the rows are inserted into the index.

Also I find the library metaphor very useful to build a mental model:
Say you're in a library and you're looking for a book. Without any hints, you'll have to search through all the books to find it. But if books are put on shelves in alphabetical order or by genre, that's already much easier! So, put that information in a book at the entrance and voilà — you've got an index!

I'm tossing up as to whether I explain how a B-tree works or just stick to a high-level explanation. That will dictate the strategy I take with gifs, images and which direction I take it.

I also enjoy the library metaphor:

"Databases are like libraries.

Tables are like books stored in a library.

Rows are like pages of a book and to make getting to the page you need quicker, you need to create an index. Flipping through a textbook page by page looking for that one page you need is going to take time, the same way querying millions of rows in an unindexed table is going be time-consuming and tedious."

I think a high-level B-tree explanation would be interesting — but perhaps you could warn that this needs some CS fundamentals (trees)?

Your approach to the library metaphor is interesting — I actually thought of it one level up, i.e. "a table is like a library" because I was thinking about a book as a row in a table. For example, a book has many properties (like genre, publish date, content, title, authors…) that we can relate to columns and relationships. Also, I'd be cautious not to cause confusion because "page" also have a specific meaning in the realm of databases (grouping of rows stored on disk). :-)

Anyway, excited to read this next article!

This is why I’ve held off tackling indexes so far. I’d like to pitch it to beginners but feel like I’m going to get too far into CS topics I wasn’t intending to explain.

Back to the draft :)

 

Great post, Helen! Next time I'm logged into my website, I'll throw it into a "suggested reading" section of my "Every SQL Join..." post.

I stuck to "same people, different tables" joins because in Salesforce-land, most of what beginners do is beat their heads against the wall trying to combine customer data from different sources into a single table-shaped output, and I think a fair amount of my audience may come from Salesforce-land. (Also, I was having trouble thinking of meaningful examples that would be good for all my join types when sticking to proper PK-FK relationships.)

But I wanted to get around to also showing how joins could be used to look up more traditional PK-FK-connected details. Maybe now I don't have to ... I can just link to your examples. :-D

 

Thank you, that's great feedback. :)

Do you find SOQL to be very different to T-SQL over in Salesforce land? I'd love to read more about how they differ and any tips on making it easier to switch gears.

 

Yup, hugely different! You don’t get to DO joins of your own choosing at all. There’s a limited set of sub selects, “parent table lookups,” and semi-joins you can do between tables that already have PK-FK relationships, but there’s no “build whatever you want.” The closest thing, in-platform, is using their Java-like language to hand-join (iterate yourself imperatively) table dumps in memory. The seminal explanation of what you CAN do in pure SOQL is Jeff Douglas’s “SOQL, how I query thee, let me count the ways” (blog post)

Wow, sounds like a tough sell for a SQL/Excel analyst to get their head around. From what I understand it's what you use in Wave to do more than the basics.

That sounds more like SAQL, I think, which I haven’t played with because Wave (now “Einstein Analytics?”) costs extra and we don’t have it. But it DOES have a query language that lets you join at your leisure, I believe I recall from presentations on it. Looked way more like SQL, I believe. SOQL comes free as part of Salesforce for basic queries against the tables you create. SOQL is mostly custom-join-free, I suppose, both for simplicity and to keep non-DB-type amateurs from executing long-running cross-join queries on a server shared with other companies.

Hmmm I’m going to have to investigate the SOQL/SAQL a bit more. It sounds like I’ll be supporting analysts using one of them. It would be good to know which one!

 

Thanks for the nice intro! I would recommend having a look at sql-joins.leopard.in.ua/ for visualizing JOINS ☺️

 

Helen …

Thanks for the quick examples (here and elsewhere).

I think you should expand on what LEFT Vs RIGHT JOIN mean (ie. the table specified first is the 'left' table).

For CROSS JOIN I tend to think of it as an "EVERYTHING JOIN" and it can generate a MASSIVE amount of data and is generally bad. If you CROSS JOIN two tables with 1,000 rows each you get 1,000,000 rows.

There are practical uses for CROSS JOIN but they are few and far between. For example:

"I will generate a list of all reps linked to all products and you can then cross out the ones you don't want" has been quite common sort of thing.

I was surprised to see UNION discussed here but … yep it is a join or sorts. However, I think it is a whole new topic all on it's own.

Further topics? Sub-query? Co-related Subquery - and why is can be SOOO bad and should be avoided at all costs with anything other than small datasets!

Dale

 

Hi Dale

Thanks for all your comments, I appreciate the feedback.

You’re right, I could have given more of an explanation as to what I meant by LEFT and RIGHT. The image doesn’t speak for itself so it could have been clearer.

The Cross join is more of a warning than anything else. As you say, things can get out of hand very quickly.

I included Unions at a very high level to show that you can ‘join’ tables both horizontally and vertically but agree it is a blog post on its own.

Thank you for suggestions too, I’ll be posting a few more data blog posts before the year is out!

Helen

 
 
 
 
code of conduct - report abuse