How UUIDs Actually Work: v4 Randomness, v7 Timestamps, and the Collision Math

#database #backend #computerscience

Every backend developer types uuid() a thousand times before ever asking what those 36 characters actually are. Then one day a senior engineer says "stop using v4 for your primary key, it's wrecking the index," and suddenly the thing you treated as a magic random string has versions, trade-offs, and a surprising amount of math underneath. Here is the whole picture.

A UUID is 128 bits wearing a costume

A UUID is just a 128-bit number. The familiar form — f47ac10b-58cc-4372-a567-0e02b2c3d479 — is those 128 bits written as 32 hexadecimal digits, split into groups of 8-4-4-4-12 with dashes for readability. The dashes carry no information; they're punctuation.

But not all 128 bits are free. Two small fields are reserved:

4 version bits — which UUID type this is (the 4 in ...-4372-... marks a version 4).
2 variant bits — which UUID spec it follows (almost always RFC 9562, the 2024 standard that replaced the old RFC 4122).

So a "random" UUID isn't 128 random bits. It's 122 random bits plus 6 bits of bookkeeping. That number matters later, so hold onto it.

The versions you'll actually meet

There are several versions, but three show up in real systems:

Version 1 — timestamp + MAC address. A 60-bit timestamp (100-nanosecond ticks since October 1582, of all dates) combined with a clock sequence and the machine's network MAC address. It's sortable by creation time, but it leaks where and when it was made — the MAC address is literally embedded. That privacy footgun is why v1 fell out of fashion.

Version 4 — (almost) all random. 122 random bits, no timestamp, no machine identity. It became the default because it's dead simple: no shared state, no coordination, any process can mint one offline and trust it's unique. This is what most uuid libraries hand you by default.

Version 7 — timestamp + randomness, done right. Standardized in RFC 9562 (2024). The first 48 bits are a Unix millisecond timestamp; the remaining bits (minus version/variant) are random. You get v4's "generate anywhere, no coordination" property and v1's time-ordering — without leaking a MAC address. v7 is the one the industry is quietly migrating to, for reasons that become obvious once you look at databases.

The collision question everyone asks about v4

"If v4 is random, won't two eventually be the same?" Yes — in the same sense that you might win the lottery twice on the same day. The interesting part is the actual scale.

With 122 random bits, there are 2^122 ≈ 5.3 × 10^36 possible v4 UUIDs. Naively you'd think you're safe until you've generated half of those. But collisions follow the birthday paradox: the chance two values match grows with the square of how many you generate, not linearly. The rule of thumb is that you reach a ~50% chance of any collision after roughly the square root of the space:

50% collision  ≈  1.18 × √(2^122)  ≈  2.7 × 10^18 UUIDs

That's 2.7 quintillion. To actually hit a coin-flip's worth of collision risk, you'd have to generate a billion UUIDs per second for about 85 years straight. For any normal application, v4 collisions are a non-event — you'll lose data to disk failure, bad migrations, and off-by-one bugs long before randomness betrays you.

If you want to feel the birthday-paradox effect for your own numbers — say, "what's the collision risk at 10 billion rows?" — it's the same combinatorics that powers lottery odds and hash-collision estimates; you can plug values into a permutation and combination calculator and watch how fast the probability climbs with the square of the count. It's a good intuition pump for why hash sizes and ID widths are chosen the way they are.

Why v4 quietly hurts your database

Here's the part that bites teams in production. If you make a v4 UUID your primary key on a database that clusters rows by primary key (InnoDB in MySQL, and effectively most B-tree primary indexes), every insert lands at a random position in the index.

Random insert positions mean:

Page splits everywhere. New rows don't append to the end; they wedge into the middle of already-full index pages, forcing the engine to split them.
Cache thrashing. Recently inserted rows are scattered across the whole index, so the hot pages your buffer pool wants to keep in memory keep changing.
Index bloat and fragmentation. Over millions of rows, the index grows larger and slower than a sequential key would.

An auto-increment integer never has this problem because every insert appends to the end. The tragedy of v4 is that you adopt it for the distributed, coordination-free generation — then pay for it in write amplification on a single database.

How v7 fixes it without giving up distribution

Version 7 puts a millisecond timestamp in the high bits. Because index ordering reads left to right, UUIDs generated close in time sort close together. Inserts become mostly sequential — new rows append near the end of the index, just like an auto-increment key — while the trailing random bits still guarantee uniqueness across processes with no shared counter.

So v7 gives you:

Append-friendly inserts (no random page splits)
Time-sortable IDs for free (great for "newest first" queries and pagination)
No central coordinator, no MAC leak

That's why "use v7 for primary keys, v4 only when you specifically want unpredictability" is becoming the standard advice. (ULID and KSUID are earlier, non-standard takes on the same time-prefix idea; v7 is the official version of that pattern.)

A practical cheat sheet

Public-facing or security-sensitive IDs (password-reset tokens, share links): you usually want unpredictability, so v4 — or a dedicated cryptographic token, not a UUID at all.
Database primary keys: prefer v7 for index locality; fall back to auto-increment if you don't need distributed generation.
Never parse meaning out of a UUID. Don't assume v1's timestamp, don't sort v4s expecting order, and don't treat a UUID as a secret just because it looks random.
Storage: store the 128 bits as a UUID/BINARY(16) column, not a 36-char string — you're otherwise spending 36 bytes to hold 16 bytes of data and slowing every index comparison.

The takeaway

A UUID isn't a random string; it's a 128-bit number with 6 reserved bits and a version that decides everything about how it behaves. v4 is random and collision-proof in practice but hostile to clustered indexes; v7 keeps the coordination-free magic while sorting by time so your database stops fighting you. Pick the version for the job instead of letting the library default decide.

If you just need a few to test with, you can generate UUIDs here — and next time someone on your team says "just use a UUID," you'll know which one they should mean.

Author bio: Quan Nguyen builds free, no-signup developer tools at calculators.im, including a UUID generator and a subnet calculator.