Unique IDs

#softwareengineering #technicaldesign #mobile

Let's make our IDs truly unique. Easy!

TL;DR Use UUID v4.

How can we generate actually unique IDs for our entities on client?

On a single device

Let's start this exercise with the simplest solution that comes to mind: ID is a number that we increment by 1 each time we generate a new one. Easy enough. But there's a technical limit. The number will eventually overflow and start generating the same values or fail. In reality though, if we choose a big enough representation, like a 64-bit integer, for example, most applications will never exhaust the limit. 2ˆ64 = 18,446,744,073,709,551,616 (over 18 quintillions) unique entities could take several lifetimes to create. However, this approach requires that we remember the last generated number, i.e. we need to persist the last value. Can we have unique IDs without persistence?

How about randomly generated numbers? UUID version 4 is considered unique enough by experts. Indeed, the chance of finding a duplicate within 103 trillion version-4 UUIDs is 1:1,000,000,000. But notice it is not zero. It may well happen that the same number will be generated even let's say on the next attempt; it's just extremely rare. Probably rarer than having cosmic radiation affect your program. This is an issue with any randomly generated number -- since they're random, there's a probability they will produce collisions. Can we do better?

We can. We can look to the ever-increasing system clock and follow a similar approach as Snowflake ID. On a single device/installation the machine ID is always the same, so we can drop it. We take the timestamp and add an in-memory counter, which we don't need to persist. It's fine that the counter restarts on app re-launch. We only need to differentiate between IDs happening at the same instant. The timestamp can even be less granular, like seconds or milliseconds only. The counter can overflow (i.e. restart) at will, as long as it has sufficient size. How many entities an app would create in an instant anyway? I'd say much less than over its entire lifetime. So we could end up with something as simple as: ID = timestamp + sequence number. With one problem. The clock on client can be modified by user and thus it is not reliable. We can't control the clock.

We could work around it. We can store the last used timestamp and compare it with the timestamp on the next generation. If it's not growing, i.e. it's equal to or bigger than the latter timestamp, it's been tampered with. We can increase and store our clock change counter and proceed as before, with the ID being: ID = clock change counter + timestamp + sequence number. It has one downside though. We need persistence again to store the last timestamp and the clock change counter. Back to square one, which is way simpler.

So what's the best approach if we want to avoid persistence? I think the UUID is a good start. Then it's about storage policy. If we accept the fact that collisions may happen, but are very rare, we may consider them so rare that we don't need a recovery strategy. We just need to identify them and fail. Thus, once we create an entity and generate an ID for it, when we try to add it to a collection where identity matters or save it to persistence storage, we should check whether an entity with the same ID already exists and if so crash the app, making the user repeat the operation. If the chance of that actually happening is so infinitesimally small to affect one or two users once over several years, why even bother to support a technical solution for recovery? Many persistence frameworks (databases) make the task even easier by distinguishing create and update operations. They already check IDs on create. For those that only support upserting, we should check manually in code.

Across a distributed system

In a regular server-client application that supports multiple clients per user, the situation is more challenging. Imagine we allowed our users to start locally, then they could eventually synchronize their data to a server under an account, connect a different client, e.g. a mobile app on a different device, download data there, share it for a while across devices, then unenroll one device and later enroll it again. How unique is that?

Sadly, IDs generated on one client are more likely to collide with IDs from another client. Even if there's not much data overall and we wanted to take the previous approach of identifying collisions and failing, it wouldn't work well here. The data is already created and persisted on client. If the server refuses synchronization, without a strategy for amending IDs, the data would get stuck on a device and wouldn't propagate across the system. No matter how many times the user repeats the process. This calls for a coordination authority.

The server can ensure unique IDs. One way is to make IDs hierarchical, for example ID = installation ID + local entity ID. The client can negotiate a unique installation ID (because the device ID is not always available for security reasons) with the server when enrolling. But there's a number of problems like support for changing the installation ID on enrollment, identifying previous installations on re-enrollment to avoid duplication of data, and similar.

There's a better way. The server can generate its own IDs for new data. Hence the globally unique ID in the system could be: ID = server entity ID ?? local entity ID. In practice probably a tuple ID = (server entity ID, local entity ID). Before the data is uploaded to the server for the first time, the local ID should be unique to the given installation. Afterward, the server ID is enough to identify an entity across the system. So far so good on paper.

Consider the mechanics practically:

The client generates a local unique ID. ✅
Sends new entity with the ID to the server. ✅
The server generates a new, globally unique ID to avoid conflicts between clients and stores the entity. ✅
The server communicates the new ID back to the client, but the connection drops. ❌
The client sends the same entity to the server again, only with the local ID. ✅
The server creates a new record, duplicating the entry. ❌

Communication between the server and clients may fail. How can we ensure the IDs are kept consistent?

The installation ID can still help us here. We can consider a combination of installation ID + local entity ID unique. If we keep record of the installation ID for each created entity (i.e. the client that uploaded the entity), we can then doublecheck an entry for the ID pair doesn't already exist. Then resolve the conflict if needed.

That's easy, right?