Using a UUID (Universally Unique Identifier) as a primary key in SQL databases has both advantages and disadvantages. While UUIDs offer benefits in certain scenarios, there are reasons why they might not be the best choice for a primary key:
Indexing and Performance:
UUIDs are 128 bits long, compared to 32 bits for a typical integer. This larger size can result in increased storage requirements and decreased performance, especially when dealing with large datasets.
Indexes built on UUID columns may not perform as efficiently as those on smaller data types. This is because larger keys can lead to more page reads, impacting query performance.
Readability and Debugging:
Unlike integers, UUIDs are not human-readable, which can make debugging and manual inspection of the database more challenging. Integers or other smaller data types may be more convenient for developers and database administrators.
Clustering:
UUIDs are designed to be globally unique, but they are not guaranteed to be sequential. This lack of sequential ordering can result in suboptimal disk I/O patterns, affecting the performance of certain types of queries, especially those involving range-based searches.
Fragmentation
UUIDs are often generated using a combination of timestamp and random values. This randomness can lead to higher levels of index fragmentation, impacting database performance over time.
Storage Overhead:
Storing UUIDs can lead to increased storage requirements, both in terms of disk space and memory. This can be a concern in environments where storage costs are a critical factor.
Application Complexity:
Managing UUIDs, especially their generation and uniqueness across distributed systems, can add complexity to the application logic. This complexity may not be necessary if simpler primary key types suffice for the application's requirements.
Top comments (66)
And how would you ensure pk uniqueness on a same entity in a distributed system?
Because you can't do it with int autoincrement.
And I don't see how declaring a primary key field as uuid default gen_random_uuid() instead of int autoincrement is any different in terms of extra complexity.
UUID exists for a reason and people use it for a reason. That doesn't mean it's good fit for every scenario. Samo goes for int autoincrement.
You can get unique autoincremented IDs in distribited systems. For example in MySQL you can configure autoincrenent increment and offset. Using those first server will generate 1, 101, 201, 301, 401, etc (increment 100 + offset 1). Second server will generate 2, 102, 202, 302, 402, etc (increment 100 + offset 2). Very simple and effective. And you can migrate those rows between servers without risk of collision.
This will not work with anonymus replicates (for example notebooks running a local database - my old company did this with MS Server Replication -
The only thing that will work in this scenario are uuids
That is definitely not simple. You need to know (at least approximately) how many servers you'll have, you also presumably need to either number them somehow and then you need to get your next increment value before creating the object in some cases to be truly atomic from the application user's point of view.
Bigint unsigned with increment 1000 solves this issue entirely. Be realistic.
Yes, numbering servers requires discipline and misconfiguration can be fatal.
As for ability to generate ID before actual insert and never hit reordering penalty - same trick can be applied. Just use sequence generator.
I think this is great topic for more advanced post. OP barely scratched the surface, because in almost all relational databases there must be PK / UK for data internal ordering and row replication. Using UUIDs has nasty consequences in large scale. I'll write my own post soon, clearing few misconceptions around PK generators.
Why wouldn't you do what basically every distributed database does? Hash the pkey and split that across the n number of nodes available.
Your approach won't scale as soon as the are not keys used then you owned for.
Because databases like MySQL or PostgreSQL organize data by PK internally, so using UUID or any other non incremental sequence results in huuuuuuuge performance drop on inserts.
Pretty clickbaity. "Why you should never" is a bit extreme. At least explain a horror story that has scarred you personally or something that caused you to write this. I use UUIDs at work and it has worked out fine for me.
There's a spanish chanel who is well spoken on this subject that made a video arguing the contrary.
youtu.be/wR5b0OhbUyw?si=JeJmicrHey...
It's main argument is the fact that id's never travel from the backend to the frontend.
An argument against not using them is security. Regular id's are ordered. If you were to have an endpoint like users/:Id, everyone could know every user id just by counting (user/1, user/2, user/3...). It could be solved with some Salt, but then you get more complexity..
They also provide multiple ways to aliviate some of the pains that are generated by using UUID.
Deserves a watch! Anyway good post
And lastly, altho the chance of collision / two uuids being the same is very small, it still exists ...
Yeah, but you should worry more about spontaneous combustion
Do you know they say you have enough unique UUID to number all the atoms of the known universe ? saying the risk of collision is existing, is like saying you can win the lottery 10 times in a row , as the probability exists to , it should be a number like 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001 chances on two billions
In fact, ANY probability can find a singularity where it can happen: I'll be happy if I win the lottery just once, and all the poor morons like me who spend money in a ticket would be happy too ... But the fact is that there are a lot more people who lose than people who win on each lottery session. So the probability you loose ten times in a row is much much bigger than my first assertion .
If you want to say something that gives structured information , you have to provide all the variables.
So how much probability do you have to have a collision against the probability of having a unique UUID ? This would be interesting and just show that the chance of collision is so small that it would be one of the biggest miracles of all times if it happens once
the problem isnt that the chance is LOW, the problem is that the chance IS. There is no guarantee that this will not happen if you generate two uuids, thats the point, but it is less significant than what reasons were mentioned in the article, from usability and logic point of view UUIDs makes no sense.
Can be viewed as a transient error, since a retry will fix it. But do make sure id generation is part of the retry.
Frederik, what methodology did you use to draw these conclusions?
Without first describing the scientific method you used, all these are just unsubstantiated hypothesis (except for human readability and storage issues) and nothing else, really.
I've worked with a couple of SQL Server databases where GUIDs had been used for clustered primary keys. The performance in those areas truly sucked: a couple of orders of magnitude worse than where ints were used.
If a GUID was deemed necessary as an ID my first choice for a design would be to split that from the clustered key. Put the GUID in a non-clustered UNIQUE index and the int /bigint PK as an auto-incremented clustered index. Depending on the other data requirements I might split that from the rest of the data, putting the real data in one or more "extension" tables using the same PK and a 1:1 relationship.
I wouldn't say never, there are situations where it's actually better to use a UUID over simple 32 bit integers, however as with any other tool, you should know what you're doing if you want to get good results.
100% agreed.
However, we are using UUID as the PKeys, but I suppose it is suitable for microservices, at least for the time being.
Someday, it might come back to haunt us... I am almost sure it will :D
As long as these are comb/sequential ID's you will have no issues even in indexing but if they are not and are not sequential then you may. I have experience first hand in this.
Sorry, I don't get it here.
You know what, you should write about solutions to this, instead of straight ratioing uuid. If a beginner reads this they will, yeah autoincrement is a good idea, but its not, cuz lets say if id is exposed in endpoints, yeah your DX(Developer Experience) is improved, but bad guys sitting behind you, their DX is also far better than you. They will simply sneak in and type "Hey, give me the /products/123, then /product/122 like that. So, autoincrement is a terrible idea. According to you, UUIDs are bad (but not for me). Politely asking suggest some solutions, if UUIDs and autoincrement are bad.
Thanks.
I never expose internal keys .
Why would you do that ?
Also I didn't say UUId are bad. I am using them all the time.
I just don't use then for primary keys and foreign keys
This reminded me of this video:
Currently one of the most popular uuid generators
github.com/paralleldrive/cuid2
Performance loss of uuid vs integer -> is/was about 6% in real world systems in one article I read (measure your own data) -> so nearly completely negletible
Readability and Debugging: -> if you have 7-9 character long integers -> readability is same as an uuid
Clustering -> adressed by cuid2
Fragmentation -> issue is negletable, since only a minor perf loss
Application Complexity -> completely wrong
uuid massively decrease Client-Server complexity, because the id can be generated by the client/browser without having to wait for and parse the server response. This reduction in architectural complexity is more worth than any perf loss suffered by the db.
I would add: Locking behaviour on UUIDs can be surprising. When using UUIDs as a PK (or even unique key), partial locking can result in ranges being locked that are completely unrelated, due to the effective randomness of UUIDs. Deleting an entry from a long time ago can still result in a lock that may impact a fresh insert, for example.
On the upside, accessing a (usually) recent one doesn't lock al the other recent ones, where the most work is probably ongoing.
I guess we can use UUID concept on mixed basis, Like for user Identification or payment id we can use uuid as it will be unique and hard to crack, where as for smaller concepts like product ID, customer ID, etc, we can use normal Smaller datatype
Insist on credible evidence to substantiate your claims, as your current discussion lacks empirical support, additionally, various statements seem baseless and lack credibility.
If you manage a substantial database and harbor concerns about UUIDs, contemplate replacing your UUID primary key with a Snowflake ID, which proves more effective than the conventional integer auto-increment.
en.m.wikipedia.org/wiki/Snowflake_ID
Note: avoid using chatgpt in future articles.
I've been using snowflake IDs exclusively the last 2 years
I used to think this when i was a junior developer. But this article is WRONG and should instead read ( why you should not use a non sequential or non-comb type UUID ). If you are going to use UUID/GUID's for Id fields then my recommendation is to use sequential or comb guids. We use these in our large scale SaaS solutions and they are much better in many ways than integers. They are performant ( comparable to integers for indexing ) and can be generated within the app layer or db layer and can be generated on distributed systems without clashes. Anyone reading this article should first read up on the above points because restricting yourself based on this article will really hinder your ability to build large-scale applications. integers are fine for small scale e.g a few million record tables but when you start to move to large scale and more importantly in browser record creation such as ticketing systems etc you WILL have to use UUIDs or some kind of string. At this point, Comb GUIDs become the go-to solution. @frederik_vl I urge you to amend your original article as it is factually incorrect and missleading.
Further more to provide additional context to my above comment. There are some things people should be aware of regarding GUIDs that not everybody is.
There are many articles that compare the overall performance of int's vs ( newsequentialId as it is in SQL Server or Comb GUID's) and i am not going to waste time linking through to them. What i would refer people to is the following repo that covers some of this and provides a way to generate sequential guids across different application layers ( COMB ).
github.com/richardtallent/RT.Comb