Working with big data requires attention to details, testing, retesting, comparing data returned by the stored procedures, views, table valued functions etc. with the expected result. So understanding the core concepts of SQL is a must.
I am currently working on a project for an insurance company and the amount of data they have is huge. One of their main functionality is having reports where they can interpret the data as they need.
While testing something on one of their reports, I found a bug in the stored procedure that fed the report with data. They were trying to JOIN two sets of data, the left hand side contained NULL on a particular column, and the right side contained 0 on the same particular column.
While you may think that representing NULL on the right hand side was the fix to go to, this wouldn't have worked. Instead, this was the point where I thought representing NULLs with 0s would work. I'll demonstrate below why NULL wasn't working and how representing it with 0 fixed this.
Let's consider the following example:
LEFT hand side:
VALUES (1, null), (2, 1), (3, 2)) leftHandSide(Id, fk)
RIGHT hand side:
VALUES (null, '0'), (1, '1'), (2, '2')) rightHandSide(Id, fk)
I'm performing a JOIN between them:
SELECT * FROM
(VALUES (1, null), (2, 1), (3, 2)) leftHandSide(Id, fk)
INNER JOIN
(VALUES (null, '0'), (1, '1'), (2, '2')) rightHandSide(Id, fk)
ON rightHandSide.Id = leftHandSide.fk
And I want my result set to look like this, it won't work.
Id fk Id fk
1 NULL NULL NULL
2 1 1 1
3 2 2 2
Instead I will only get the last 2 rows
Id fk Id fk
2 1 1 1
3 2 2 2
When you have NULL values in a column, these will never be matched to other NULL values. This is because NULL signifies the absence of any value, and cannot be compared as they will never equal anything.
So, what I did was to represent NULL to 0 so that my result set contains the 3 rows:
Id fk Id fk
1 0 0 0
2 1 1 1
3 2 2 2
by simply doing
SELECT * FROM
(VALUES (1, ISNULL(NULL, 0)), (2, 1), (3, 2)) leftHandSide(Id, fk)
INNER JOIN
(VALUES (ISNULL(NULL, 0), '0'), (1, '1'), (2, '2')) rightHandSide(Id, fk)
ON rightHandSide.Id = leftHandSide.fk
Lesson learned: whenever you have to deal with NULL, and you don't understand why your result set is missing some rows that you expect, think about representing NULLs to 0s.
Top comments (2)
I'd be careful just replacing NULL to 0, because the actual data could be 0 and you would be entering false data. For example, air temperature or other meteorological data.
Suppose there was a thermometer failure and some temperatures were not registered. If you simply put 0 to these you are introducing false data that can lead to wrong forecasting.
It's common to replace missing meteorological data with -999 or -9999, because they aren't possible values.
So, instead replacing NULL to 0, I'd replace it for an impossible value in your context.
Hey, thank you for the comment. For this particular case you explained would indeed be better to have an impossible value. For my case though, 0 could never have been a possible one, reason why I replaced NULL to 0. It's really depending on the context. :) Also, thank you for reading my first article.