This month's PGSQL Phriday #015 discusses using UUID in PostgreSQL. I have previously written on this topic (UUID or cached sequences?) , and I won't tell you which one to use, as I believe that the decision should be based on your business requirements, application design, and access patterns. However, the PGSQL Phriday #015 introduces the topic as a fight between the database community and developers, which I think is the wrong approach and may lead to bad SQL application design.
A database community should prioritize its users, who are primarily developers, instead of opposing them. As someone who has worked with databases in operations and development teams, I love understanding the internals of SQL databases as much as discussing the best application design with the developers. The ultimate objective is to provide the best database service to the application and deliver value to the users. After years of consulting, I'm now a Developer Advocate for YugabyteDB, a PostgreSQL-compatible distributed SQL database. I consider anyone who builds applications on PostgreSQL, whether running official distributions or connecting to compatible managed services, part of the PostgreSQL community.
On this topic of UUIDs, we need to listen to the developers. PostgreSQL provides many data types and the right choice is not a general opinion but depends on the application usage.
As a SQL fan, if I want to generate an identity, I use a SEQUENCE, which generates unique numbers in a scalable way (when cached):
postgres=# create table my_table (
id bigint generated always as identity (cache 100)
, message text
);
CREATE TABLE
postgres=# \ds my_table_id_seq
List of relations
Schema | Name | Type | Owner
--------+-----------------+----------+----------
public | my_table_id_seq | sequence | postgres
(1 row)
postgres=# insert into my_table(message)
values ('Hello World')
returning id
;
id
----
1
(1 row)
INSERT 0 1
postgres=# select * from my_table where id=1;
id | message
----+-------------
1 | Hello World
(1 row)
By using SQL's declarative generated always as identity
, all rows get an identifier as soon as they are created. This identifier is obtained immediately when creating the row, making it available for later use. The database engine has plenty of optimizations to make sequences scalable. PostgreSQL caches the sequence number by connection, YugabyteDB can also cache it in the tablet server, Amazon Aurora optimizes it in the shared buffers. This looks good, but I've run this from psql
, a DBA tool, and not from an application.
In real life applications, the lifecycle of database objects, like my "Hello World" message here, starts before the row is inserted into the database.
Modern applications do not persist their data by sending SQL in text strings like the DBA does with psql
. The application instantiates data objects, with data and business logic, and synchronizes them through an ORM, which provides a type-safe object-oriented API to interact with the persistent database. The object's lifecycle starts in the application, before it reaches the database. It has a transient object identifier, the address in memory, but needs a primary key before it is saved to the database.
Any persistent object must be identified from the beginning of its lifecycle, which is the reason why the primary key is not nullable. You cannot rely anymore on generated always as identity
that will generate an identifier only after reaching to the database. What are your options?
You can use a sequence, call nextval()
when creating the object in the application (@GeneratedValue(strategy=GenerationType.SEQUENCE)
in JPA), build the object in the application and then save it to the database. This works but is not ideal as it will involve a roundtrip to the database for this single purpose.
Now you understand the value of UUID: it can generate a unique identifier in the application, without reaching to the database, which makes sense for an object that starts its existence before the creation of the corresponding row in the database. I hope you understand that listening to the developer is necessary to provide relevant database recommendations.
It's necessary to generate a unique identifier in an application to ensure High Availability, or at least get it generated before commit. If a database failure occurs during a commit operation, it's impossible to know whether the transaction was committed or not. Ask your DBA what to do in the following situation where you don't know if your insert was committed or not, and, if it did, you don't have its identifier:
postgres=# \set AUTOCOMMIT on
postgres=# CREATE TABLE demo (
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
message text
);
CREATE TABLE
postgres=# INSERT INTO demo(message) VALUES ('Hello World')
RETURNING id;
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
If a unique identifier is generated before the insert operation, it can be retried without the risk of inserting duplicate records, because the database's primary key checks for uniqueness. An "INSERT ON CONFLICT DO NOTHING" statement can be used to make the operation idempotent.
Let's discuss the choice of UUID generation algorithm by considering the input from developers. The decision criteria will depend on the business requirements. For instance, if the identifier is exposed to the user, it should not carry any technical meaning that can be guessed by a malicious user, and a random UUID is the best solution for that.
Some performance considerations will benefit from a discussion among developers and database experts. Inserting a random UUID into PostgreSQL B-Tree may cause issues, being too scattered, while it is not problematic in LSM-Tree, like YugabyteDB. If rows inserted at the same time need to be queried together, a UUID prefixed by a time component can be used (UUID v7). An additional prefix can assign a small bucket number to distribute the data ingest without scattering them too much (customized in a UUID v8).
It's important to note that one should not rely on a single thought when making decisions about data types. For instance, in an application, UUID and serial numbers may be used as identifiers. Many tables will have both a natural key, which is exposed to the user, and a surrogate key that is used internally for referential integrity. The natural key may be generated as a random UUID, while the surrogate key may come from a sequence. It's also important to keep in mind that certain database limitations, such as PostgreSQL declarative partitioning or Citus sharding, may limit the possibilities of enforcing uniqueness through global indexes. While a UUID provides a high probability of uniqueness, a sequence strictly guaranties it, so if you cannot create all the unique constraints, a sequence may be the safer choice.
One important recommendation when it comes to database design is to use the UUID datatype, with a binary representation, and not TEXT, which is larger, but I can say the same with numbers and dates. It's important to keep in mind that UUID is not different from other datatypes and you should choose the one that best suits your business requirements. Additionally, for performance and scalability, you should generate it with a prefix that distributes or clusters the values based on your data access patterns.
Top comments (6)
Thanks for sharing it, Franck. Excellent article 👏🏻
I'm not a database specialist like you, but as an application developer, I try to take the best out of the databases I work with as much as possible. I've been studying this theme since I wrote an article about it (in pt_BR 🇧🇷). So, I could sum up the "UUID as a PK" issue with RDBMS as follows:
The more random the key is, the worse it is for the database performance, especially with MySQL and MSSQL Server that use clustered indexes by default (that's why UUIDv4 might be terrible);
If your dataset fits into memory (buffer cache and OS cache), I would say it does not matter if you're using UUIDv4 or a sequential key;
I think this issue we'll be in the past as soon as UUIDv7 becomes a standard;
Well, although I still favor and recommend using sequential keys (int/bigint) over UUID as PKs, I also understand it will not be a big issue for the majority of microservices out there since their data and workload were partitioned already 😊
One other consideration that goes all the way to the frontend is Relay and GraphQL. In particular, Relay expects a globally unique identifier graphql.org/learn/global-object-id... . Of course generally the object type (table) is automatically appended to the ID so there's no need to specifically choose UUID. But supposedly you can use both in the same table (have an autoincrementing sequence plus another column with UUID and a non-clustered index). That is probably the best of both worlds.
Even more interesting is using uuidv7 (draft RFC) generated in the application. The best thing is that it is incremental unlike uuidv4 you use typically. Uuidv7 is compatible with v4, that is postgres can store it as uuid type, and you can sort by it easily.
@franckpachot, could you talk a little bit more about those 2 statements? Are they related? I got a bit confused by them.
And this one:
As ignorant end user my take away is: that’s awesome, I can request from vendor to integrate it and make available into our postgresql based system as in our line of work we need UUID in virtually every business correspondence with government agencies.
Yes, it makes sense to have a UUID when it is visible to the user and I suppose it must be random (not disclosing the insert time). Just know that B-Tree on them will be larger and see random writes so it is important that they fit in memory cache