Frederik Van Lierde

Posted on Dec 9, 2023

Why you should never use an UUID as the primary key in SQL Databases

#sql

Using a UUID (Universally Unique Identifier) as a primary key in SQL databases has both advantages and disadvantages. While UUIDs offer benefits in certain scenarios, there are reasons why they might not be the best choice for a primary key:

Indexing and Performance:
UUIDs are 128 bits long, compared to 32 bits for a typical integer. This larger size can result in increased storage requirements and decreased performance, especially when dealing with large datasets.

Indexes built on UUID columns may not perform as efficiently as those on smaller data types. This is because larger keys can lead to more page reads, impacting query performance.

Readability and Debugging:
Unlike integers, UUIDs are not human-readable, which can make debugging and manual inspection of the database more challenging. Integers or other smaller data types may be more convenient for developers and database administrators.

Clustering:
UUIDs are designed to be globally unique, but they are not guaranteed to be sequential. This lack of sequential ordering can result in suboptimal disk I/O patterns, affecting the performance of certain types of queries, especially those involving range-based searches.

Fragmentation
UUIDs are often generated using a combination of timestamp and random values. This randomness can lead to higher levels of index fragmentation, impacting database performance over time.

Storage Overhead:
Storing UUIDs can lead to increased storage requirements, both in terms of disk space and memory. This can be a concern in environments where storage costs are a critical factor.

Application Complexity:
Managing UUIDs, especially their generation and uniqueness across distributed systems, can add complexity to the application logic. This complexity may not be necessary if simpler primary key types suffice for the application's requirements.

Top comments (65)

vbilopav • Dec 9 '23 • Edited

And how would you ensure pk uniqueness on a same entity in a distributed system?

Because you can't do it with int autoincrement.

And I don't see how declaring a primary key field as uuid default gen_random_uuid() instead of int autoincrement is any different in terms of extra complexity.

UUID exists for a reason and people use it for a reason. That doesn't mean it's good fit for every scenario. Samo goes for int autoincrement.

Paweł bbkr Pabian • Dec 10 '23 • Edited

You can get unique autoincremented IDs in distribited systems. For example in MySQL you can configure autoincrement increment and offset. Using those first server will generate 1, 101, 201, 301, 401, etc (increment 100 + offset 1). Second server will generate 2, 102, 202, 302, 402, etc (increment 100 + offset 2). Very simple and effective. And you can migrate those rows between servers without risk of collision.

mkoehrer • Dec 11 '23

This will not work with anonymus replicates (for example notebooks running a local database - my old company did this with MS Server Replication -

The only thing that will work in this scenario are uuids

mjaggard • Dec 10 '23

That is definitely not simple. You need to know (at least approximately) how many servers you'll have, you also presumably need to either number them somehow and then you need to get your next increment value before creating the object in some cases to be truly atomic from the application user's point of view.

Paweł bbkr Pabian • Dec 10 '23

Bigint unsigned with increment 1000 solves this issue entirely. Be realistic.

Yes, numbering servers requires discipline and misconfiguration can be fatal.

As for ability to generate ID before actual insert and never hit reordering penalty - same trick can be applied. Just use sequence generator.

I think this is great topic for more advanced post. OP barely scratched the surface, because in almost all relational databases there must be PK / UK for data internal ordering and row replication. Using UUIDs has nasty consequences in large scale. I'll write my own post soon, clearing few misconceptions around PK generators.

csgeek • Dec 10 '23

Why wouldn't you do what basically every distributed database does? Hash the pkey and split that across the n number of nodes available.

Your approach won't scale as soon as the are not keys used then you owned for.

Paweł bbkr Pabian • Dec 10 '23

Because databases like MySQL or PostgreSQL organize data by PK internally, so using UUID or any other non incremental sequence results in huuuuuuuge performance drop on inserts.

Michael Kochell • Dec 10 '23

Pretty clickbaity. "Why you should never" is a bit extreme. At least explain a horror story that has scarred you personally or something that caused you to write this. I use UUIDs at work and it has worked out fine for me.

Rubén Chiquin • Dec 10 '23 • Edited

There's a spanish chanel who is well spoken on this subject that made a video arguing the contrary.

youtu.be/wR5b0OhbUyw?si=JeJmicrHey...

It's main argument is the fact that id's never travel from the backend to the frontend.

An argument against not using them is security. Regular id's are ordered. If you were to have an endpoint like users/:Id, everyone could know every user id just by counting (user/1, user/2, user/3...). It could be solved with some Salt, but then you get more complexity..

They also provide multiple ways to aliviate some of the pains that are generated by using UUID.

Deserves a watch! Anyway good post

RadekHavelka • Dec 9 '23

And lastly, altho the chance of collision / two uuids being the same is very small, it still exists ...

Artur Mostowski • Dec 10 '23

Yeah, but you should worry more about spontaneous combustion

karlatt • Dec 20 '23

Do you know they say you have enough unique UUID to number all the atoms of the known universe ? saying the risk of collision is existing, is like saying you can win the lottery 10 times in a row , as the probability exists to , it should be a number like 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001 chances on two billions
In fact, ANY probability can find a singularity where it can happen: I'll be happy if I win the lottery just once, and all the poor morons like me who spend money in a ticket would be happy too ... But the fact is that there are a lot more people who lose than people who win on each lottery session. So the probability you loose ten times in a row is much much bigger than my first assertion .
If you want to say something that gives structured information , you have to provide all the variables.

So how much probability do you have to have a collision against the probability of having a unique UUID ? This would be interesting and just show that the chance of collision is so small that it would be one of the biggest miracles of all times if it happens once

RadekHavelka • Dec 21 '23

the problem isnt that the chance is LOW, the problem is that the chance IS. There is no guarantee that this will not happen if you generate two uuids, thats the point, but it is less significant than what reasons were mentioned in the article, from usability and logic point of view UUIDs makes no sense.

Joachim Lous • Dec 11 '23

Can be viewed as a transient error, since a retry will fix it. But do make sure id generation is part of the retry.

immōtus • Dec 11 '23

Frederik, what methodology did you use to draw these conclusions?

Without first describing the scientific method you used, all these are just unsubstantiated hypothesis (except for human readability and storage issues) and nothing else, really.

Adaptive Shield Matrix • Dec 11 '23 • Edited

Currently one of the most popular uuid generators
github.com/paralleldrive/cuid2

Performance loss of uuid vs integer -> is/was about 6% in real world systems in one article I read (measure your own data) -> so nearly completely negletible

Readability and Debugging: -> if you have 7-9 character long integers -> readability is same as an uuid

Clustering -> adressed by cuid2
Fragmentation -> issue is negletable, since only a minor perf loss

Application Complexity -> completely wrong
uuid massively decrease Client-Server complexity, because the id can be generated by the client/browser without having to wait for and parse the server response. This reduction in architectural complexity is more worth than any perf loss suffered by the db.

Richard Crawshaw • Dec 10 '23

I've worked with a couple of SQL Server databases where GUIDs had been used for clustered primary keys. The performance in those areas truly sucked: a couple of orders of magnitude worse than where ints were used.

If a GUID was deemed necessary as an ID my first choice for a design would be to split that from the clustered key. Put the GUID in a non-clustered UNIQUE index and the int /bigint PK as an auto-incremented clustered index. Depending on the other data requirements I might split that from the rest of the data, putting the real data in one or more "extension" tables using the same PK and a 1:1 relationship.

Zohar Peled • Dec 10 '23

I wouldn't say never, there are situations where it's actually better to use a UUID over simple 32 bit integers, however as with any other tool, you should know what you're doing if you want to get good results.

Ivan Zakutnii • Dec 10 '23

100% agreed.

However, we are using UUID as the PKeys, but I suppose it is suitable for microservices, at least for the time being.

Someday, it might come back to haunt us... I am almost sure it will :D

Neil • Dec 11 '23

As long as these are comb/sequential ID's you will have no issues even in indexing but if they are not and are not sequential then you may. I have experience first hand in this.

Ivan Zakutnii • Dec 14 '23

Sorry, I don't get it here.

FJones • Dec 10 '23

I would add: Locking behaviour on UUIDs can be surprising. When using UUIDs as a PK (or even unique key), partial locking can result in ranges being locked that are completely unrelated, due to the effective randomness of UUIDs. Deleting an entry from a long time ago can still result in a lock that may impact a fresh insert, for example.

Joachim Lous • Dec 11 '23

On the upside, accessing a (usually) recent one doesn't lock al the other recent ones, where the most work is probably ongoing.

View full discussion (65 comments)