DEV Community

loading...
Cover image for Generating unique IDs in a Large scale Distributed environment

Generating unique IDs in a Large scale Distributed environment

Apoorv Tyagi
I write about algorithmic solutions for software engineering problems, interesting programming tips, tricks & resources. May also include machine learning, NLP & Blockchain
Originally published at apoorvtyagi.tech ใƒป6 min read

Introduction

Recently at work, We were looking for a way to generate unique IDs across a distributed system that could also be used as the primary keys in the MySQL tables.

We knew in a single MySQL database we can simply use an auto-increment ID as the primary key, But this wonโ€™t work in a sharded MySQL database.

So I looked at various existing solutions for this and finally learned about Twitter Snowflake - a simple 64-bit unique ID generator.

Why don't you use UUID? ๐Ÿค”

UUIDs are 128-bit hexadecimal numbers that are globally unique. The chances of the same UUID getting generated twice are negligible.

The problem with UUIDs is that they are very big in size and donโ€™t index well. When your dataset increases, the index size increases as well and the query performance degrades.

Another problem with UUIDs is related to the user experience. Eventually, our users will be needed that unique identifiers. Imagine that a customer calls Customer Service and is asked to provide the identifier. Having to spell a complete UUID is not a pleasant experience.

Twitter snowflake โ„

Twitter snowflake is a dedicated service for generating 64-bit unique identifiers used in distributed computing for objects within Twitter such as Tweets, Direct Messages, Lists, etc.

These IDs are unique 64-bit unsigned integers, which are based on time. The full IDs are made up of the following components:

  • Epoch timestamp in millisecond - 41 bits (gives us 69 years with respect to any custom epoch)
  • Configured machine/node/shard Id - 10 bits (gives us up to total of 210 i.e 1024 Ids)
  • Sequence number - 12 bits (A local counter per machine that sets to zero after every 4096 values)
  • The extra 1 reserved bit at the beginning which is set as 0 to make the overall number as positive.

image.png

Since these use the timestamp as the first component, therefore, they are time sortable as well. Another benefit is its High Availability.

By default, 64-bit unsigned integers (long) will generate an Id whose length is 19, but sometimes it may be too long, our use case needed an Id whose length should not be greater than 10.

This article will share a simplified version of the unique ID generator that will work for any use-case of generating unique IDs in a distributed environment based on the concepts outlined in the Twitter snowflake service.

Coding time โŒš

In our case, the full ID will be composed of a 20-bit timestamp, 5-bit worker number, and 6-bit sequence number.

The remaining 1-bit is the signed bit and it is always set to 0 to make the final value positive.

Our microservices can use this Random number generator to generate IDs independently. This is efficient and fits in the size of a int (4 Bytes or 32 bits).

Here is the complete code in Java (Inspired by Twitter snowflake, code credits) -

Step 1 - We initialize the number of bits that each component will require :

public class Snowflake {

    // Sign bit, Unused (always set to 0)
    private static final int UNUSED_BITS = 1; 

    private static final int EPOCH_BITS = 20;
    private static final int NODE_ID_BITS = 5;
    private static final int SEQUENCE_BITS = 6;

    private static final int maxNodeId = (int)(Math.pow(2, NODE_ID_BITS) - 1);
    private static final int maxSequence = (int)(Math.pow(2, SEQUENCE_BITS) - 1);

    // Custom Epoch (Fri, 21 May 2021 03:00:20 GMT)
    private static final int DEFAULT_CUSTOM_EPOCH = 1621566020;

    private volatile int lastTimestamp = -1;
    private volatile int sequence = 0;

    // Class Constructor
    public Snowflake() {
        this.nodeId = createNodeId();
        this.customEpoch = DEFAULT_CUSTOM_EPOCH;
    }
Enter fullscreen mode Exit fullscreen mode

Here, we are taking custom epoch as of Fri, 21 May 2021 03:00:20 GMT.

EPOCH_BITS will be 20 bits and is filled with a current timestamp in seconds (You can also use millisecond if there is a possibility of multiple numbers of requests per second).

NODE_ID_BITS will be 5 bits and is filled using the Mac address.

SEQUENCE_BITS will be 6 bits and will act as a local counter which will start from 0, goes till 63, and then resets back to 0.

Step 2 - Creating a synchronized function to generate the IDs :

        public synchronized int nextId() {
            int currentTimestamp = (int) (Instant.now().getEpochSecond() - customEpoch);

            if(currentTimestamp < lastTimestamp) {
                throw new IllegalStateException("Invalid System Clock!");
            }

            lastTimestamp = currentTimestamp;

            return currentTimestamp << (NODE_ID_BITS + SEQUENCE_BITS) | (nodeId << SEQUENCE_BITS) | sequence;
        }
Enter fullscreen mode Exit fullscreen mode

Wait, why are we doing these Left Shifts & Logical OR operations?

This is because Integer is represented by 32 bits and initially all are set to 0.
To fill these bits we have to take each component separately, so first we took the epoch timestamp and shift it to 5 + 6 i.e 11 bits to left. Doing this has filled the first 21 bits with the first component (remember the first bit is always set to zero to make the overall number positive). The remaining 11 bits are still 0 and hence again we repeat the same thing with logical OR & the other two components as well thereby filling all the 32 bits and forming the complete number.

Step 3 - Utility function to generate the node id using the systemโ€™s MAC address:

private int createNodeId() {
            int nodeId;
            try {
                StringBuilder sb = new StringBuilder();
                Enumeration<NetworkInterface> networkInterfaces = NetworkInterface.getNetworkInterfaces();
                while (networkInterfaces.hasMoreElements()) {
                    NetworkInterface networkInterface = networkInterfaces.nextElement();
                    byte[] mac = networkInterface.getHardwareAddress();
                    if (Objects.nonNull(mac))
                        for(byte macPort: mac)
                            sb.append(String.format("%02X", macPort));
                }
                nodeId = sb.toString().hashCode();
            } catch (Exception ex) {
                nodeId = (new SecureRandom().nextInt());
            }
            nodeId = nodeId & maxNodeId;
            return nodeId;
        }
    }
Enter fullscreen mode Exit fullscreen mode

How it works? ๐Ÿ’ก

Letโ€™s now understand its working with an example -

Letโ€™s say itโ€™s Sun, 23 May 2021 00:00:00 GMT right now. The epoch timestamp for this particular time is 1621728000.

First of all, we adjust our timestamp with respect to the custom epoch-

currentTimestamp = 1621728000- 1621566020 = 161980(Adjust for custom epoch)

So to start our ID, the first 20 bits of the ID (after the signed bit) will be filled with the epoch timestamp. Let's this value with a left-shift :

id = currentTimestamp << (NODE_ID_BITS + SEQUENCE_BITS )

Next, we take the configured node ID/shard ID and fill the next 10 bits with that

id = id | nodeId << SEQUENCE_BITS

Finally, we take the next value of our auto-increment sequence and fill out the remaining 6 bits -

id = id | sequence // 6149376

That gives us our final ID ๐ŸŽ‰

And thatโ€™s it! Primary keys that are unique across our application!

Summary ๐Ÿ“Š

This article showed you a simple solution of how to generate a snowflake id whose length is >=7 and <=10.

By the way, you can adjust the bit count of the 3 components to adapt to your work.

NOTE :
We should keep the generator as a singleton, it means that we should only create the single instance of SequenceGenerator per node. If not, it may generate some duplicate Ids.

Not only did twitter used it, Discord also uses snowflakes, with their epoch set to the first second of the year 2015.

Instagram uses a modified version of the format, with 41 bits for a timestamp, 13 bits for a shard ID, and 10 bits for a sequence number.

I hope this will help you! Thanks for reading :))


Want to start with web development? Checkout โ–ถ HTML To React: The Ultimate Guide

This ebook is a comprehensive guide that teaches you everything you need to know to be a web developer through a ton of easy-to-understand examples and proven roadmaps

It contains ๐Ÿ‘‡

โœ… Straight to the point explanations

โœ… Simple code examples

โœ… 50+ Interesting project ideas

โœ… 3 Checklists of secret resources

โœ… A Bonus Interview prep

You can even check out a free sample from this book

and here's the link to the complete book, if you want to buy it for yourself

Original price is $ 40 but price with this link - $16(which is 60% less than the original price)๐Ÿ˜‰

๐Ÿ‘‡

eBook.png

Discussion (1)

Collapse
golfman484 profile image
Driving change • Edited

Great article! I used a similar time based + sequence ID generator to create "unique" class IDs in a visual class editor many years back.

Just one point I would like to make: It would seem that if you ran this on many nodes and requested lots of IDs that it is likely that duplicate IDs will be generated because of the way the node ID is derived from the MAC in the createNodeId() method.

While every MAC should be unique, by taking only the 5 least significant bits of the MAC means that you're only using 5 bits of the last byte of the MAC. The probability that two MAC addresses will have the same last byte is 1 in 256. The chance that they will have the same 5 least significant bits is 1 in 160 (5/8 * 256). If only winning the lottery had odds as good as that!

Let's say you decided to have 32 nodes then the chances of any two nodes ending up with the same node ID becomes 1 in 5!!! (160/32) - now that's the lottery I want to buy a ticket in! :)

In this case it may be safer to manually (or otherwise) allocate a specific, unique node ID to each node via some configuration mechanism to ensure that no two nodes end up with the same node ID - and then the chance of two different nodes generating the same ID goes to zero in a million bazillion :)

I picked 32 nodes in the example above because that is the maximum possible nodes you can have with a 5 bit node ID (2 ^ 32) - if each node is allocated its own unique dedicated ID then all 32 nodes will generate unique IDs which may not happen if auto allocating node IDs based on the MAC via createNodeId().