Yesterday I was updating some Ruby gems on one of the projects that we have at my current company, and noticed one called active_record_union. I have to admit that I had never heard about SQL's
UNION operation, so that gem generated a lot of curiosity on me. That's why I decided to get more information about it and write this learning!
The purpose of SQL Union is to combine the results of several queries into a single one. For example, let's assume that we have the following table in our DB, representing Users.
Let's assume that we want to search two types of users:
- The users over 23 years old
- The users located on the United States
In order to do that, you'll need the following two queries:
SELECT * FROM Users WHERE Country = "United States"
SELECT * FROM Users WHERE Age > 23
However, this will require two DB calls; usually you want to obtain your result using as few DB calls as possible, and therefore this won't be an optimal solution. Besides, using different queries requires a manual processing of the results to combine them into one (as each query returns a result set)
This is where
UNION comes to play. Using
UNION you can both two queries into a single one, and therefore obtain the same results in a single DB operation.
SELECT * FROM Users WHERE Country = "United States" UNION SELECT * FROM Users WHERE Age > 23
By doing this, we will just need one DB operation, and also we will get all the matching users in the same result without needing to do any manual combination.
Probably you asked yourself that question when you saw the previous section. And you're right: you could also combine both queries into one by using several conditions linked with
OR, like this:
SELECT * FROM Users WHERE (Country = "United States") OR (Age > 23)
However, the key is in the performance of each operation. If you have indexes in the columns you're using to do the search, you will probably be better using
OR will make the SQL operation to use indexes just for one of the columns, or even not using them at all. On the other hand, using
UNION resolves both queries separately, enabling the use of the most optimal query plan for each query, which probably means using the index of each of the columns if they have one. You can find a more detailed explanation about it in the following links:
UNION enables the obtention of results of several tables, instead of limiting to a simple one. For example, let's assume that now we have two tables: one for customers and one for suppliers:
If we wanted to obtain all the customers or suppliers in USA, we could do something like the following:
SELECT 'Customer' AS Type, CustomerID AS ID FROM Customers WHERE Country = "United States" UNION SELECT 'Supplier', SupplierID FROM Suppliers WHERE Country = "United States"
And we would obtain something like this:
Notice how we didn't even need to specify the column names in the second query! As the columns are in the same position,
UNION automatically knows which column name has to be used.
(It's important to consider that this will only be possible if both operations returned the same number of columns, and if its types were similar).
Something that I didn't mention is that, when combining the results of the queries,
UNION does an extra operation to remove repeated rows. Going back to the first example, this would have been useful if instead of searching for those users older than 23, we would have wanted to search those older than 20: the user 1 matches both condition, and therefore he would have been present in both results. But to prevent us to have repeated data,
UNION would have filtered the results and returned his row only once.
However, let's assume that we're completely sure that we won't have repetitions. For example, if we had a table like the following one:
This time, let's assume that we want to retrieve:
- The users whose Country is United States
- The users whose Continent is Asia
We are absolutely certain that there will never be a coincidence between those two groups. Therefore, letting
UNION search for repetitions is a waste of time.
For those cases, we can use the
UNION ALL operation: it also combines several queries, but without checking if there are repetitions in the result. Therefore, the following query:
SELECT * FROM Users WHERE Country = "United States" UNION ALL SELECT * FROM Users WHERE Continent = "Asia"
Will return as a result:
And it will actually be a faster than using a simple
UNION (how much faster will depend on the size of your result).
However, it also requires you to be more careful about your operations,because if you combine two queries that can have repeated results...
SELECT * FROM Users WHERE Country = "Spain" UNION ALL SELECT * FROM Users WHERE Continent = "Europe"
You will, in fact, end with repeated results.
UNION statement is a really useful resource if you need to search by several conditions and want to take as much profit as possible from the indexes associated to the corresponding columns. It is also really useful if you want to combine results from two tables that have a similar structure (for example, if you're using Concrete Table Inheritance and only need the common data).
However, it also has it's downsides: you need to be aware that the repetition removal step may slow down the query a little bit if you're retrieving a lot of rows; but the good part is that you can use
UNION ALL instead to skip that part. Of course, then you need to be completely sure that you won't have any repeated data among the different queries. As Uncle Ben would say: "With great power, comes great responsibility"