DEV Community

Cover image for e1889d6d-945d-436b-b94f-68f268285e7c — WTF? I’m only human
Nils Diekmann
Nils Diekmann

Posted on • Originally published at Medium

e1889d6d-945d-436b-b94f-68f268285e7c — WTF? I’m only human

Designing human-friendly identifiers is easy - follow these simple guidelines.

If you are just interested in a quick solution, download the tutorial of my C# and Go examples on GitHub. Read on for the motivation behind the solution.

GitHub logo KinNeko-De / sample-humanfriendly-id-csharp

Sample how to create a human-friendly identifier in C#

sample-humanfriendly-id-csharp

Sample how to create a human-friendly identifier in C#

codecov

Generate a human friendly id

dotnet run --project Sample.HumanFriendlyId/Sample.HumanFriendlyId.csproj

Motivation

The motivation behind this project is to described in an article




GitHub logo KinNeko-De / sample-humanfriendly-id-go

Sample how to create a human-friendly identifier in Go

sample-humanfriendly-id-go

Sample how to create a human-friendly identifier in Go

codecov

Generate a human friendly id

go run cmd/humanfriendly-id/main.go

Motivation

The motivation behind this project is to described in an article





Accept the differences - Humans are not systems and vice versa

The debate is as old as mankind. Some argue that values in systems must be human-readable because the user is human. Others want to force the user to deal with technical values because they are already used by the system.

Have you ever typed a UUID from a letter? If so, I have great respect and at the same time pity for you. UUIDs are designed for systems, not for people. On the other hand, business identifiers, like invoice numbers, are not designed for reliable relationships in systems, especially not in modern distributed microservices.

generated by https://deepai.org/machine-learning-model/text2img

In computer science, we have the principle of 'separation of concerns'. It says that you should divide your program into distinct sections. If you follow this principle, you benefit from more freedom of use.

It can also save us unnecessary headaches when designing the solution. We can still use UUIDs for internal system IDs. In the rare cases where we need to communicate with a human, use a specially designed human-friendly identifier. I will now give a guideline on how to design it.


Requirements - select what needs to be met

Requirements always depend on the context. I will provide a list of requirements based on my experience. If some of the requirements are not necessary for you, feel free to leave some out. If you have additional requirements, please leave a comment. I would appreciate it if you explain why you have these requirements so that I can learn from you.

  • (R) It must be easy to read by humans.
  • (P) Errors of perception must be avoided.
  • (W) It must be easy to spell, hand-write, and type by humans.
  • (O) It must be easy to read by optical character recognition.
  • (S) It must be hard to guess.
  • (U) It must be unique for the related entity.

R This ensures that users can quickly and accurately interpret the identifier without confusion or difficulty. By prioritizing readability, we aim to improve the user experience and streamline the use of the identifier in different contexts.

P It is important to design the identifier in a way that minimizes misinterpretation. By prioritizing clarity and distinctiveness, we aim to reduce the risk of misreading the identifier.

W This ensures ease of use across multiple communication channels and input methods. We aim to make communication and data entry seamless for users.

O This ensures that the identifier can be accurately scanned and processed by machines, increasing efficiency and automation in various applications. By designing the identifier with OCR readability in mind, we ensure compatibility with modern technology and data processing workflows.

S It is important to make the identifier sufficiently complex and unpredictable to prevent unauthorized access or manipulation. Our goal is to enhance security and protect the integrity of the identifier and associated data.

U By enforcing uniqueness, we maintain data integrity and facilitate accurate referencing and retrieval of information. Each entity is uniquely identified without any overlap or duplication.

From time to time in this article, I will refer to these requirements with the letter at the beginning. This means that I have taken this particular requirement into account when designing a human-friendly identifier.


Characters - make the identifier easy to read

To generate the identifier, we need a specific set of characters. We have to choose this collection carefully because it affects readability (R), perception (P), and writing (W). Ensuring the suitability of these characters is essential to their effectiveness in interacting with humans.

In addition to the characters, the choice of font we use is important in terms of readability (R). Have a look at the appearance of the following characters in print. As you can see in the image below, it is hard to tell the difference even when you see them side by side. Imagine how difficult it would be if you could only see one character. We need to choose a font that makes it easier to distinguish between the characters.

big i, small l, and the number 1 in the font: Calibri, Arial, Times New Roman, and Courier New

The use of both lowercase and uppercase letters also raises the question of whether the reader needs to use a case-sensitive identifier in a written or typed response. A program cannot easily normalize the input to avoid errors. If the character set contains only uppercase letters, we can easily convert any input to uppercase. This choice also removes the small "l" problem.

We need as many characters as possible to ensure uniqueness (U) and to keep the identifier short (R). By adding numbers, we increase the number of possible characters. On the other hand, it creates a problem with the uppercase "O" and the zero, as they look very similar in any font.

big o and zero in the font: Calibri, Arial, Times New Roman, and Courier New

We can remove one of them from the set. As a reader of the identifier, you do not know that we omitted only one character and which one. I have seen several times that there is a note describing whether both zero and "O" or only one of them is used. If we remove both characters, we avoid any kind of confusion for the human (R).

If you live in a country like Germany, where they even print out electronic receipts and still use real paper letters (to the rest of the world: this article was written in 2024!). It is also important to consider that someone will reply handwritten (W).

To OCR (O), a handwritten "1" may look like an "I". Removing just one of them, preferably the "I", is useful in this case because we can normalize the input to replace a recognized "I" with a "1". The number one is more likely to be recognized differently from other characters. By excluding the "I", we also make it easier to recognize the characters as there is no conflict with the small "l" (R).

This leaves us with the following charset:

ABCDEFGHJKLMNPQRSTUVWXYZ12345679


Length - how much do you really need?

In general, it is a good idea to use a wide set of characters as it affects the number of unique identifiers you can generate (U). When you use a narrow character set because of the reason above, you have to increase the length of your identifier to allow more recombination.

The number of possible combinations is calculated as follows:

n is the number of characters in the set, k is the length of the identifier

We have to keep the identifier reasonably short. Long identifiers can be difficult to remember and read (R). If you ever run out of combinations because your business scales up like hell, just increase the length for new identifiers. You can keep the old one instead of setting yourself up for success which will never happen if you use too long identifiers.

With a length of only 8, there are 1,099,511,627,776 combinations, which is enough for most of the time. That is 140 times the number of people in the world. Look at this example, close your eyes, and try to remember it:

AKL1U9ZW

On the other hand, sometimes an identifier needs to be hard to guess (S) when used in a security context. Using a length that is too small makes it more likely that a valid identifier will be found by brute-force attacks. It also makes it more likely that a typo will lead to another valid identifier (U)(W).

Using the character set from the example above and a length of 16 we get the astronomically large number of 1,208,925,819,614,629,174,706,176 combinations. Try to remember the following example as you did it before. Not so easy anymore, right?

AKL1U9ZWKWA6QPAB


Display - the best way to recognize it

The identifier itself is a random, uninterrupted combination of characters. It has no obvious meaning or context. The human brain tends to look for familiar patterns. Since there are none here, it is very difficult to recognize each of the characters.

Remember the tally marks we used when we were kids? After drawing four vertical lines, the fifth line is drawn diagonally across the previous four lines. This groups five of the strokes together to make counting easier.

Both times twenty elements

Grouping items helps people to understand and organize information more quickly and efficiently. This is also known as the "law of proximity". We cannot stroke our characters, but we can use a separator between them. We can use dashes, minuses, hyphens, or slashes, as UUIDs use minus. But when you type or spell the identifier, don't you always wonder if you can leave out the separators?

As children, we would never write the tally marks with a minus or even a slash between them. Such characters were not in our minds at that time. Children always use the simplest and most obvious solution. After five strokes, we leave a little space to group them even more. In the following example the second line is easier to read, right?

Grouping by distance

We can do the same simple trick with our identifier and use spaces as separators (P). These separators are not even visible to humans. When we process the input for a computer we can normalize it by removing all the spaces between the characters. This makes it as easy as possible for humans to type the identifier. And when you spell the identifier hopefully you will not even notice that there is a hidden separator.

After how many elements should we use a separator? According to "Miller's Law", people can remember about seven elements. The exact number depends on the person, so we will use the lower limit to include everyone (P). Other publications tend to define this lower bound as only four elements. This is exactly the number we used with the tally marks. See the result:

AKL1 U9ZW HKAL BCND

As you may have noticed, we are not the only ones using this rule. IBAN or credit card numbers are grouped the same way. I am pretty sure there are more real-life examples in your cultural context to convince you that this idea is widely adopted.


Random - at least pseudo and safe

The identifier should be randomly generated using a cryptographically secure algorithm (S). This makes it easier for you to pass a pen test, especially in a security-critical scenario. Proactively demonstrating that you have taken this into account during design increases confidence in your overall solution.

Even in non-security scenarios, you want to avoid using auto-incrementing numbers. The recipient of an invoice with invoice number "5" does not need to know that you have only written five invoices this year. Hiding this information is also in the best interest of the business.

Note that the identifier is not globally unique like UUIDs. The generation algorithm is pseudorandom, and the length is shorter so that we have fewer combinations. To make it unique for a given entity, you must ensure this using a shared state between concurrent processes, such as a unique constraint on a database (U).


Summary

  • Use the character set ABCDEFGHJKLMNPQRSTUVWXYZ12345679
  • Generate randomly an identifier with a length of 4 or 8
  • To display it to humans insert a space after four characters
  • On human input remove spaces and replace "I" with "1"

How do you know all that shit? Are you a real programmer?

Thank you to the University of Paderborn for teaching me that software engineering is more than just coding. Now it is my time to give something back.

Code examples

Last but not least check out the code examples in C# and Go on my GitHub page. I would be happy if you could give them a star. Thanks for reading my article ❤

Top comments (0)