Recently I found a quite common request on StackOverflow. Generalizing the problem, it can be described as the requirement of insert some data into a table only if that data is not there already.
Many developers will solve it by trying to execute two steps:
- check if the data exists already,
- if not, insert it
This approach has a flaw, whatever the database you are using and no matter the database if relational or not. The problem, in fact, lies in the algorithm itself.
The moment you have two actions, where one depends on another, you need to make sure that the data used by both doesn't change in the meantime because of some other action done by someone else. As you can easily understand, in fact, if you have done the first action, you really need to be sure that in-scope data doesn't change before you can do the second action, otherwise the result may be wrong or inconsistent.
You need a transaction, that will provide the needed isolation so that interference won't happen.
As the chosen algorithm is comprised of two separate steps, you must create a transaction big enough to keep both steps under the same umbrella, so that the two physical separate steps will logically work like one. This behavior is also known as atomicity: the two steps are indivisible. Everything works and changes will be persisted, or nothing works and changes will be undone.
While this will work perfectly from a functional perspective, if you are in a highly concurrent system and you need to have a very scalable solution where an extreme number of executions can be done in parallel, you really want to have such protection around your transaction – the isolation – for the shortest time possible, as for as long as you are using that data in your algorithm, other may have to wait for accessing it, to avoid interfering with your activity.
How to solve this problem elegantly and without having such big transactions? One option is use what is defined as "Optimistic Concurrency". This approach uses a resource version token – for example, an ETag – to make sure that data didn't change between the first and the second step. If data was changed, you simply restart from step 1 and loop continuously until you manage to complete the algorithm or you reach the maximum number of attempts allowed.
Now, that's a very basic approach. It works, but I think we can do better. Much better.
What if, for example, we can do both steps in just one command? No need for big transaction, no need for less-then-optimal loops.
With Azure SQL, doing that is easy: you can
INSERT a row into a table using the result of a
SELECT on that table. Does it start to ring a bell?
By using an
INSERT...SELECT command, we can achieve exactly what is needed. One command, without any explicit transaction. Let's say we have a table, named tags, that stores all the tags associated with a blogs post. A tag can be used in different posts, but only once per post. The table would look like the following:
create table [dbo].[tags] ( [post_id] int not null, [tag] nvarchar(50) not null, constraint pk__tags primary key clustered ([post_id], [tag]) )
Using such table as example, an
INSERT...SELECT to implement the insert-if-not-exists logic would look like:
insert into [dbo].[tags] ([post_id], [tag]) select * from ( values (10, 'tag123') -- sample value ) as s([post_id], [tag]) where not exists ( select * from [dbo].[tags] t with (updlock) where s.[post_id] = t.[post_id] and s.[tag] = t.[tag] )
SELECT will create a virtual table with the data we want to insert. One or more rows can be created with that technique (it works very nicely up to a few hundred rows. If you need more rows then JSON, Table Valued Parameters or Bulk Insert are a better choice). The virtual table will be called
s and will have two columns:
tag. Data types will be automatically inferred; if you want to have some specific data type, you can always
CAST the value to make sure the data type you want will be used. The
UPDLOCK is a hint to tell Azure SQL that we are reading with the goal to update the row. By allowing the engine to know that, the internal mechanism of lock conversion can be optimized to guarantee the best concurrency and consistency.
WHERE clause will make sure only those rows that's doesn't already exists in the target table -
tags - will be returned from the virtual table and passed to the
INSERT statement will do exactly what it says: insert rows into the
tags table, if any.
That may sound a little verbose and quite complicated for such a simple operation, so you'll be happy to know that all that code can be simplified a lot using the
MERGE statement. The logic behind the scenes is the same, but the code is much leaner in my opinion:
merge into [dbo].[tags] with (holdlock) t using (values (10, 'tag1233')) s([post_id], [tag]) on t.[post_id] = s.[post_id] and t.[tag] = s.[tag] when not matched then insert values (s.[post_id], s.[tag]);
MERGE is a very powerful command and would also allow you to implement in the same command also the upsert (insert-or-update) behavior. When executing multiple operations on the same data (after all the
DELETE all together) you may have to be careful if you have triggers or if one row could potentially be affected by multiple actions...but in the simple case of a insert-if-not-exists you shouldn't be worried. If you are interested in learning how much more you can do with
MERGE, read here the detailed doc page.
Challenge solved, with no big transactions involved, much simpler code on the client side, no matter with language you are using.
To be clear, a transaction will still take place, but since everything is done in a single command, the exclusive lock taken on the tag resource will usually be measured in microseconds, assuming your case is like the discussed sample: if you're inserting much more than one row at time, the elapsed time will be different, depending on how many rows you are trying to insert.
If you are worried about this, keep in mind that Azure SQL will allow readers to read from the
tags table even while someone is changing its data as by default it uses the
READ COMMITTED SNAPSHOT isolation level. With that level, the last committed version of the row is kept available and served to anyone asking for read, so that they can still be isolated without being blocked.
Of course, this ability has a cost in terms of resource usage (CPU, Memory and IO, as the previous versions of the row are kept in a version store) and we want to be careful on avoiding wasting resources (just like it happens in the real world!) so having a truly short transaction also helps on this side.
The result of this little optimization is a cleaner code, better performances, less resource usage, more concurrency, and increased efficiency.
You may be interested to know if the one described is the only way to implement an insert-only-if-exist pattern.
I already mentioned the
MERGE, but that there are couple of other ways to solve this matter.
The two presented here, especially the
MERGE, will be more than enough for you if you don't want to get into this only apparently simple topic even more.
In case you want to, instead, I have found a remarkably interesting article the summarizes the most common techniques and does some evaluation on pros and cons of each one here: https://cc.davelozinski.com/sql/fastest-way-to-insert-new-records-where-one-doesnt-already-exist.