When building any system, distributed or not, you end up dealing with many identifiers for your data. From database rows all the way to identifiers for the version of your system in production. Unsurprisingly Encore's own systems are no different. We need to uniquely identify and track things from individual API endpoints in your applications, traces from runtime calls, to the various
infrastructure resources provisioned in one of our supported clouds, such as an AWS security group.
Deciding how to generate your identifiers can sometimes be very simple. For instance, you might just put an auto-incrementing number as your primary key in your database. Boom, you have your type of identifier and way to generate it, just like magic.
However, it's much harder in a distributed system to just have a number start at 1 and slowly increase. You could build a system that elects a leader, and that leader is in charge of incrementing the number - but this adds a lot of complexity to your system design, it doesn't scale indefinitely as you will still be limited by the throughput of the leader. You could still suffer from a split-brain issue where the same number is generated twice by two different "leaders".
Getting this decision right early on in your project is always important, as it's one of the hardest things to change in a system once you're in production.
Starting from our requirements
When we started to think about how we wanted to represent identifiers in Encore's platform, we wrote down what we wanted as core requirements from an identifier:
- Sortable
We want to be able to order the identifiers, such that an A
< B
when A
was created first. However, we don't need total sorting, being k-sortable is acceptable.
Being sortable allows us better indexing performance in the database, will enable us to easily iterate through records in order and improves our ability to debug as the order of events can be determined from the event identifiers.
- Scalable
We want a system that will scale with us without bottlenecks. We'll be using this system to generate identifiers for traces and spans in which we'll be creating a vast number of them.
- No Collision Risk
Because we run a distributed system, we don't want the risk of two of our processes creating the same identifier.
- Zero Configuration
We don't want to have to do any level of configuration when using this system on either a per-machine or per-process level.
- Type Safety
We want a system where an identifier for one resource can't accidentally be passed or returned as an identifier for another type of resource.
As a bonus, we also wanted identifiers to be:
- Reasonably Small
Both in memory and on the wire.
- Human-readable
This is related to type safety. However, we want the string representation to allow a human (us 🤖) to understand what the identifier was created for.
This seems like quite a list of asks, but it's pretty manageable when we break it down.
Existing Options
If we temporarily ignore the type safety and human readability requirements, then what we're asking for has been solved a thousand times before, and we don't need to re-invent the wheel here. Let's look at some existing battled-tested options.
Database powered auto-incrementing keys ✘
The first port of call for most people is an auto-incrementing primary key. This solves being sortable and has no collision risk. However, it can only scale as far as your database server will process writes. It also adds a new requirement that every identifier you generate must have a matching database row. Given we wanted to use this system for things we don't store in our database, we ruled this out pretty quickly.
UUIDs ✘
There are various versions of UUIDs, and they are all 128 bits. At first glance, versions 1, 2, 4 seem perfect from a scalability, zero configuration and collision risk angle. However, when you dig deeper, the different versions of UUIDs start to show warts:
- Versions 1 & 2
Even though they can generate approximately 160 billion identifiers per second, they use the machine's MAC address which means two processes on the same machine could generate the same identifier if called simultaneously.
- Version 4
Generates completely random identifiers with 2122 bits of randomness. This means we're very unlikely to have a collision and can infinitely scale-out without a bottleneck. However, we lose the ability to sort the identifiers chronologically.
Snowflake ✘
Snowflake identifiers were first developed by Twitter and typically are
64 bits (although some variants use 128 bits). This scheme encodes the time
the ID was generated as the first 41 bits, then encodes an instance
ID for the next 10 bits, and finally a sequence
number for the final 12 bits.
This gives us pretty good scalability as the instance
set of bits allows us to run 1024
different processes simultaneously while knowing they can not generate the same identifier, as the process's own identifier is encoded within! It gives us our k-sorting, as the upper bits are the time the ID was generated. Finally, we have a sequence
number within the process, which allows us to generate 4096
identifiers per second per process.
However, Snowflake misses our zero-configuration requirement because the instance
ID must be configured per process.
KSUID ✘
KSUID's are sort of a cross between UUID Version 4 and Snowflake. They are 160 bits where the first 32 bits are the time
the identifier was generated (to the second) and then 128 bits of random
data. This makes them almost ideal for us. They are k-sortable, require no configuration, and have no collision risk because of the large amount of entropy in the random part of the id.
However, we discovered something interesting during our research of KSUID. The string encoding of KSUID uses Base-62 encoding, and so has both uppercase and lowercase letters. This means depending on your string sorting, you might sort the identifiers differently - i.e. we lose our requirement for sortability depending on the system. For instance, Postgres sorts lowercase before uppercase, whereas most algorithms sort uppercase before lowercase, which could lead to some very nasty & hard-to-identity bugs. (It's worth noting that this impacts any encoding scheme which uses both upper and lower case letters, so it isn't just limited to KSUID)
XID ✓
XID's are 96 bits. The first 32 bits are the time
, which means we get our k-sorting immediately. The next 40 bits are a machine identifier
and a process identifier
. However, unlike the other systems, these are calculated automatically using the library and don't require us to configure anything ourselves. The final 24 bits are a sequence
number, which allows a single process to generate 16,777,216 identifiers per second!
XID gives us all our core requirements, and its string encoding uses base 32 (no upper case letters to break our sorting!). This string encoding is always 20 characters, which means we can use that fact for validation purposes in any marshalling code (such as a Postgres CHECK
constraint
on the database type).
Our Identifiers
Once we had settled on using XID as the basis for our ID types, we switched focus back to our last two requirements: type safety and human readability.
For the former requirement, we could have solved this as simply as:
import "github.com/rs/xid"
type AppID xid.ID
type TraceID xid.ID
func NewAppID() AppID { return AppID(xid.New()) }
func NewTraceID() TraceID { return TraceID(xid.New()) }
And thanks to Go's type system, both AppID
and TraceID
would concretely be different types, such that the following
would become a compile error:
var app AppID = NewTraceID() // this won't compile
This would have worked. However, we would have to implement all the marshalling functions (encoding.TextMarshaler
,
json.Marshaler
, sql.Scanner
, etc.) for each concrete type like AppID
. To minimise boilerplate writing for our team, this would mean using a code generator to write it for us.
Note: We could have written the marshalling functions once using type aliases (type AppID = xid.ID
), but the compiler would have treated the ID types as interchangeable.
Another downside here is our system isn't just Go. We also have a Typescript frontend and a Postgres database. This means once we encode the ID into a wire format, we've lost all guarantees of type safety, and now in another system, it's possible to mistakenly use one form of ID for another.
Before Go 1.18, we could have solved this by adding a wrapper struct containing the type information:
import (
"fmt"
"github.com/rs/xid"
)
type EncoreID struct {
ResourceType string
ID xid.ID
}
type AppID *EncoreID
type TraceID *EncoreID
func NewAppID() AppID { return &EncoreID{ "app", xid.New() } }
func NewTraceID() TraceID { return &EncoreID{ "trace", xid.New() } }
Now we can update our marshalling functions to prefix the EncoreID.ResourceType
at the beginning of the string. One
slight downside here is that aside from passing around a lovely 20 byte array (which is how xid
is encoded in memory),
we would also be passing around a struct with string instances. Not super inefficient, but not as nice!
Enter Go Generics
With Go 1.18, we can create a single abstract ID type and then create different concrete types based on a ResourceType
,
but really are just xid
's. So conceptually, we started with this:
import (
"fmt"
"github.com/rs/xid"
)
type ResourceType struct{}
type App ResourceType
type Trace ResourceType
type ID[T ResourceType] xid.ID
func New[T ResourceType]() ID[T] { return ID[T](xid.New()) }
This gave us an excellent starting point, as the memory format of these identifiers had not changed from XID's underlying [12]byte
.
However, the compiler will treat ID[App]
and ID[Trace]
as two distinct types. We also don't need to use code generation for all the marshalling functions as we can write them once for the generic type.
The one requirement we've not solved with this generic version is our type safety for both the wire formats and the human
readability. (i.e. it's still possible for us to json.Marshal
an application ID and json.Unmarshal
that into a trace ID). To solve this, we can utilise the ability for generics in Go to create the default instance of a type. Then we can call a method on it. For example, our string method looks a little like this:
func (id ID[T]) String() string {
var resourceType T // create the default value for the resource type
return fmt.Sprintf(
"%s_%s",
resourceType.Prefix(), // Extract the "prefix" we want from the resource type
xid.ID(id).String(), // Use XID's string marshalling
)
}
However, we've skipped a tiny bit: how do we make our App
and Trace
resource types have a Prefix method we can call?
Well, we need to change ResourceType
into an interface:
type ResourceType interface { Prefix() string }
type App struct{}
func (u App) Prefix() { return "app" }
type Trace struct{}
func (a Trace) Prefix() { return "trace" }
All of our marshalling and unmarshalling code can now include and verify the identifiers type prefix, which means we
can easily validate if due to a typo, we pass trace_0ncid27rirfkbch0hld0
as an argument to an API expecting an application ID.
PS. Postgres is here
As I alluded to earlier, these strongly typed wire formats mean we can create matching types in other systems without losing any of the type safety guarantees we wanted. For instance, we use code generation to automatically create new
Postgres domain types for each of our ResourceTypes
in our database so we can strongly type the database columns.
CREATE DOMAIN application_id AS TEXT CHECK (VALUE ~ '^app_[a-z0-9]{20}$');
We then use SQLC to generate type safe Go code to read and write to our database, taking our custom ID types and passing them through all the layers to the database without losing our type safety. We'll talk about how we use SQLC and other code generation in a future blog post to give us type safety into and out of the database. I'll leave you with a little sneak peek of a generated helper:
func GetMembersForApp(q db.Querier, id eid.ID[eid.App]) ([]*db.AppMembers, error)
Final words
While generics have been a long time coming in Go, there has been excellent code generation tooling to fill that gap. The two are not mutually exclusive and can be used simultaneously to reduce the cognitive load and the boilerplate code you have to write. It's easy to overuse and abuse generics, but if you take your time and use them carefully, you'll be thinking with Generics!
You can use all the Go 1.18 features, including Generics, with Encore. We'd love it if you'd try out Encore and join Slack to give us feedback!
This article was originally published on the Encore Blog on 25-03-2022.
Top comments (0)