DEV Community


Discussion on: SQL Joins without the Venn diagrams

geraldew profile image

To start with, I generally like the look of this explanation as an article.

I'm going to be picky and cruel in saying this: but I think it's a disservice to not show that even inner joins enact row multiplications when there are multiple instances of the join value.

It's fair enough to not do so in the very first example of an inner join. Alas, having the multiplication of rows then only be shown later when the cross join is taught leaves a dreadful legacy.

I have spent many hours re-teaching analysts - who have sometimes been writing SQL for years - that have not been aware that repeated values on either side of the inner join can cause row expansions in the result. As I am wont to say - if something can happen at all in data, then at some time it will happen.

I think the basic relational fact of row expansions needs to be taught as soon as the inner join has been taught. I find that helps avoid a lot of misunderstandings about left joins too. I'm not saying it's easy, even less so in static print/screen material - I usually have the luxury of doing live peer guidance with a whiteboard, so I don't have a magic solution to proffer. I would suggest going to the bother of doing a second example of the inner join but showing a value in there twice on one side and thrice on the other (because a 2-1 or a 2-2 leaves room for the learner to misunderstand what's happening).

Of course, there will be some environments that new data analysts may encounter where enough tables have primary keys and so row multiplying joins might be rare. But I'd have to say that over my years of experience that this is becoming less common not more. Indeed I've been seeing "big data" users getting advice to not bother with key constraints for performance reasons and newer platforms such as Hadoop having fewer available constraints anyway.

I certainly don't enjoy dealing with multi-megabyte SQL scripts in which there is an unknown spot of row multiplication to be debugged. The most subtly pernicious is the combination of the coders thinking that join values are unique but having done nothing to ensure that they are - and when the data reality is that the data is only very-nearly-always unique.

While it can seem that such problems are really about managing the uniqueness in the tables I do think that a naiveté about inner joins allows problems to spread like an infection in a data processing sequence. I think that such an innocence comes from not being shown inner join row multiplications in the tuition phase.