rhymes

Posted on Sep 7, 2019 • Edited on Jan 23, 2020

What would you use as a sortable, globally unique, ID?

#distributedsystems #discuss

The objective is to have an ID that has the following two inherent properties:

globally unique: if two different nodes generate an ID, the possibilities for them to conflict are practically zero (hence no autoincrementing IDs)
sortable: the IDs, even those generated by two different nodes, can be sorted or partially sorted making it far easier to order events (hence no UUIDs)

What would you use?

Oldest comments (39)

Bertil Muth • Sep 7 '19

Why not use two fields, a UUID and a timestamp?

rhymes • Sep 7 '19

I like the lateral thinking going on in your question :)

That's the status quo though, which works decently but it still requires two fields.

What if you could look at two IDs and roughly know which one comes first? What if the IDs carried with them both info? The timing and the uniqueness...

Bertil Muth • Sep 7 '19 • Edited

Hehe. Then what about the brute force approach, concatenation of timestamp and UUID, using an intermediate character guaranteed not to occur in one of them?

Edit: lexical and chronological sorting of ISO8601 date times is identical. So if you start with that, then the UUID, does that solve the problem?

rhymes • Sep 7 '19

Then what about the brute force approach, concatenation of timestamp and UUID, using an intermediate character guaranteed not to occur in one of them?

Something like this could work :)

You're really close to the solutions I found searching on the internet, take a look at github.com/segmentio/ksuid for example

Dian Fay • Sep 7 '19

You would have to perform more complex comparisons to see if ids match, since you lose the binary representation. v1 and v1mc UUIDs are sort of ordered but not necessarily between multiple nodes, and they cycle fairly rapidly.

Jimmy Nilsson did an interesting experiment back in the day with adding the timestamp inside the UUID itself.

rhymes • Sep 7 '19

First: it's a little bit funny to read an almost 20 years old article that talks about RAM in order of megabytes :D

I love how hacky his solution was, though it's a little bit weird that the timestamp is at the end, but that's just the order of the two casts.

This part was funny as well:

A DATETIME has the "precision" of 1/300th of a second, and it's no problem for SQL Server to generate five GUIDs in 3ms with this algorithm

5 IDs in 3 ms. Oh, the year 2000 :D

Thanks for linking this, it was such a cool read!

As an aside I wonder how slower the uuid type in PostgreSQL is in comparison to integer primary keys, today.

Bertil Muth • Sep 7 '19

That’s actually what I meant: concatenation of a timestamp with a UUID to generate a new UUID. Or maybe I just don’t understand your comment :-)

Dian Fay • Sep 7 '19

When you said "intermediate character" I took it to mean a string 1567876931|11aca736-c9a9-4266-a2f6-13bd07202c09 or some such.

Bertil Muth • Sep 7 '19

That’s what I meant. I wasn’t aware of the restrictions placed on UUIDds until now that I looked it up. I just thought it meant “universally unique”, which would be the case with my proposal, I think. So thanks. One more thing I learned.

Matteo Joliveau • Sep 7 '19

May I ask what the need for sorting is?
I personally don't see many reasons to sort by id, most often what people really want is sort by creation date.

Anyway, back in topic. This a tricky question because globally unique and predictable are two properties in direct clashing with each other. A solution is taking a look at Microsoft SQL Server Sequential ID, a sortable GUID.
There a couple of disadvantages:

They are not decentralized. If they were you would lose the anti-clashing capabilities of GUIDs.
They are predictable. Don't ever expose them to the public as they can lead to resource enumeration.

Basically, you may as well use BIGINT as your ids if you need sorting.

If you need to sort events on a distributed system, better use some other properties. Like timestamps.

rhymes • Sep 7 '19 • Edited

May I ask what the need for sorting is?

Sure, efficiency and convenience mostly. Let's say we use UUIDs, they work mostly well until these IDs land in a place far from your system. Someone decides to store events on S3 using the UUID as a file, suddenly you have gigabytes of events that can't sort well unless you peek inside the file to find the timestamp.

Or you grep an event log and suddenly you have to come up with a combination of bash commands in a pipe to extract both the ID and the timestamp to sort them.

Or you want to create shards out of them to group related data in different machines, UUIDs are useless for that.

In my experience UUIDs are great, until they aren't :D

The great thing about globally unique and sortable IDs is that they carry information with them. If well designed I can even deconstruct them to extract such info.

This a tricky question because globally unique and predictable are two properties in direct clashing with each other.

Not exactly true, see the examples at the end of my comment :)

A solution is taking a look at Microsoft SQL Server Sequential ID, a sortable GUID.

This definitely wouldn't work as you said, they are basically sequential as the name says.

If you need to sort events on a distributed system, better use some other properties. Like timestamps.

But timestamps are not globally unique and can be duplicate.

I'll leave you with two examples of partially sortable IDs that are also random and unique:

K-Sortable Globally Unique IDs, or ksuid, is a type of ID that has an encoded 32bit timestamp with 1 second resolution prefix and a 128bit random part as a suffix. The explanation of how Segment got there is great: A brief history of the UUID
Universally Unique Lexicographically Sortable Identifier, or ulid, this ID uses 48 bit with millisecond resolution for timestamps and 80 bits of randomness.

For example, Firebase uses something like this for their IDs: The 2 ^ 120 Ways to Ensure Unique Identifiers

Johannes Lichtenberger • Sep 7 '19

Something like the Twitter Snowflake generator:

developer.twitter.com/en/docs/basi...

Johannes Lichtenberger • Sep 7 '19

See also this:

callicoder.com/distributed-unique-...

rhymes • Sep 7 '19 • Edited

Something like the Twitter Snowflake generator

The problem with the Snowflake generator is in its definition. It's a server that nodes have to sync up to, which would require a node that needs to be monitored, load balanced and so on, only to generate IDs. Also, if one day we want the clients to be able to generate their own IDs to send to the server the Snowflake would be an architectural limitation.

I think if works for Twitter but for example in the article you linked they mention this problem.

I'm not comfortable about the implementation in the article because it uses the MAC address which is one of the issues with the earlier versions of UUIDs but maybe something like Firebase does is better: firebase.googleblog.com/2015/02/th... - moving away from using an hardware identifier like the MAC even if they are not perfect. I think ULIDs are better github.com/ulid/javascript

Johannes Lichtenberger • Sep 7 '19

I thought Snowflake wouldn't require a central server, why would it? Haven't read the stuff now but a few years ago.

As mentioned in the second article it uses a timestamp + nodeID + sequenceID. The sequence-ID might be the same, but the nodeIDs could be anything which identifies the server uniquely I guess. Okay, finding something which is unique with only using qafew bytes, I don't know...

rhymes • Sep 7 '19

I thought Snowflake wouldn't require a central server, why would it? Haven't read the stuff now but a few years ago.

I don't know about the latest incarnation, but the one that was public years ago was a server generated with Apache Thrift and used ZooKeeper for coordination. Not exactly practical unless you're Twitter...

Yaser Al-Najjar • Sep 7 '19 • Edited

I love simplistic solutions, here is my 0.2 cents:

Use a prefix for each node like n1 / n2

Generate an incremental ID: 1, 2, 3...

So you will end up with IDs like:

n1-1
n1-2
n2-1
n2-2

Easy to sort and never gonna conflict.

Does that help with your use case?

rhymes • Sep 7 '19

Not exactly, what if the node is a client and for any reason they set the same name? What if the node changes name or is replaced by another one?

Aside from the sortable part, the problem of conflict is already solved by using UUIDs, going back to incremental IDs is not feasible in this case.

Unfortunately this, as many things in distributed systems, is not that simple.

Yaser Al-Najjar • Sep 7 '19

I get it.

You wanna have a long-term solution, however, you still didn't tell us your exact use case so we are throwing bunch of out-of-context solutions.

BTW, this might be a very interesting solution (using both UUID and incremental ID): tomharrisonjr.com/uuid-or-guid-as-...

rhymes • Sep 7 '19 • Edited

You wanna have a long-term solution

I think this is one of those few cases when there's no cost into having a correct long term solution instead of having to change the format of IDs (and thus having multiple sets of them) in time.

you still didn't tell us your exact use case so we are throwing bunch of out-of-context solutions.

I didn't write the final goal explicitly but I wrote the requirements. The final goal is to traceable unique IDs for events in a distributed system that also are sortable which is a very handy property. There's no mistery to it :D

BTW, this might be a very interesting solution (using both UUID and incremental ID): tomharrisonjr.com/uuid-or-guid-as-...

I agree that exposing IDs is not great, not even UUIDs in theory. In that case is best to have multiple keys.

It's an interesting approach but it's tied to a database. I want something that would work at each level (the programming language, the DB table, an event log and so).

Thanks for sending me suggestions though, it's really great to see all of these approaches!!

William Gottschalk • Sep 7 '19 • Edited

Have you tried ULIDs before? They’re lexicographically sortable ids unique to the ms. honeybadger.io/blog/uuids-and-ulids/

rhymes • Sep 7 '19 • Edited

Before writing this I searched and read a bit of stuff on the internet, so I ultimately encountered the ULID spec and loved it: the bit when they say: "Won't run out of space 'til the year 10889 AD" made me laugh.

I didn't mention it because I wanted to see if there were other solutions and to start a fresh discussion without "bias".

I also didn't read this blog post before, so thank you!

My only worry about adopting ULIDs is mentioned in the article:

According to the internet, some ULID implementations aren't bulletproof

so it needs a little bit of research before that.

Thanks William!

stereobooster • Sep 7 '19

Yes. Unique, sortable, more compact Base32 encoding

rhymes • Sep 7 '19

That's definitely an option!

The only alternative to ULID I found is github.com/segmentio/ksuid which has a smaller set of bits for timestamp.

Let me know how it goes? Have you checked the implementation in the language you need?

Ste Griffiths • Sep 7 '19

Instagram use a custom scheme (instagram-engineering.com/sharding...) as they need time-sortable unique IDs across multiple DB servers (shards). They implemented it in PL/SQL, composing an ID from the current time, the shard ID, and an auto-increment, giving 1024 IDs per shard per millisecond. Pretty cool, check out the link.

rhymes • Sep 7 '19

It's pretty cool indeed! Thank you! In our case I think it should be abstracted from the DB but that's just an implementation detail

fpim • Sep 8 '19

I think simply Unix timestamp + {node_id_with_pedding_zeros} will do

rhymes • Sep 8 '19

Hi fpim, thanks for your input!

What if you don't have control of the node?
What if two parallel processes generate an ID at the same exact time?

You can understand why this is risky.

Version 1 of UUID did exactly that: used a timestamp and the MAC address of the machine. The problem was that MAC addresses can be spoofed (it means I can use another machine's MAC address on mine) and those UUIDs can also be traced to the machine creating them.

For these reasons it was deprecated.

Dustin King • Sep 8 '19 • Edited

There are probably better answers, but a concatenation of:

timestamp (yyyy-mm-ddThh:mm:ssZ)
sequence number (of item generated by same node at same timestamp, left padded with zeroes so the numbers are sortable asciibettically)
node ID (also zero-padded if numeric)

Maybe include milliseconds or more depending on how fast and frequently items are generated. Also maybe node ID before sequence number depending on whether you'd rather have ones with the same node ID together, or have the same sequence number together for a given timestamp.

Something like:

2019-09-08T22:17:41Z;00003;grayarea

Edit to add: Maybe use _ to separate the fields to make it more distinct:

2019-09-08T22:17:41Z_00003_grayarea

rhymes • Sep 9 '19

Thanks Dustin, the idea of using a timestamp is pretty popular and I agree.

The problem with using a sequence generator though is predictability and also collision, which to be avoided would require coordination which is not feasible. That's why you shouldn't use things like custom random functions for secrets :D

The node ID (whatever we want to use) is also another predictable part. We don't want your node to generate IDs posing as my node, for example.

It's a really tricky thing to come up with something that's random, secure and portable.

UUIDv1 had the timestamp and the node ID (the MAC address) but it wasn't secure because MAC addresses could be guessed and spoofed.

All the implementations I saw around have a completely random part attached to the encoded timestamp, which I guess is the best way to avoid those issues.

Some have suggested ULID directly or variations on the same tune (encoded timestamp plus completely random string) for those reasons.

Dustin King • Sep 9 '19

I guess I was assuming the nodes were in a relatively trusted environment. Like, maybe distributed geographically but communicating on a secure network, where nodes are owned by the same entity. Also being able to rely on some authorities for unique node IDs (and time synchronization).

If spoofing is an issue, one could require that each ID is cryptographically signed using the node's private key. However, appending a signature might make the ID a lot longer, I'm not sure.

rhymes • Sep 10 '19

I guess I was assuming the nodes were in a relatively trusted environment.
Also being able to rely on some authorities for unique node IDs (and time synchronization).

This requires synchronization though, which is another problem in itself. It means having to handle a sort of "god machine" that releases node IDs everytime a node comes online, which still is tricky for serverless environments, unless you are considering every single process of the app a separate node. Keeping in mind that the authority could be offline or unreachable or too slow and yet another machine to handle, monitor, keep updated and secure and so on...

Unless tracing back the originating node from the ID is paramount (which could have security issues onto itself in case an ID leaks outside the trusted area), I believe letting go of the whole idea of embedding a node identifier in the final ID is a way to sidestep all of these things

Dustin King • Sep 10 '19

I think I would need to know what this is for to go further. The simplest thing satisfying your original criteria would probably be a timestamp down to the nanosecond + random per-item hex string to avoid collisions.

If node identity needs to be kept secret, then the source of random numbers could be a potential point of deanonymization, so a CSPRNG might need to be used.

Timestamps have some potential issues, but depending on the application, they may or may not matter.

rhymes • Sep 11 '19

The simplest thing satisfying your original criteria would probably be a timestamp down to the nanosecond + random per-item hex string to avoid collisions.

This is what we ended up choosing, though it's down to the millisecond. We're using ULID as the format. It's a hash of a timestamp and a random string