ahmet gedik

Posted on Jul 4

How UUID v7 Primary Keys Cut B-Tree Bloat in Our Video Store

#postgres #uuid #database #performance

We aggregate trending video metadata across a dozen Asia-Pacific regions — Japan, Korea, Taiwan, Singapore, Thailand, Vietnam — and the ingestion pipeline behind TopVideoHub writes a few million rows a day into PostgreSQL. The videos table is the hot path: every fetch cycle upserts titles, channel references, thumbnail URLs, view counts, and CJK-tokenized search fields. For almost two years our primary key was a uuid column populated by gen_random_uuid() — UUID v4, fully random. It worked fine, until it quietly stopped working fine.

The symptom was boring and brutal. INSERT p99 latency crept from 4 ms to roughly 30 ms over six months, even though row-count growth was linear and the hardware never changed. autovacuum ran almost continuously. WAL volume per insert was 3–4× the logical row size. The table itself was about 40 GB, but the primary-key index alone was 18 GB and growing faster than the data underneath it.

This is the textbook UUID v4 failure mode, and it is worth understanding exactly why it happens before reaching for the fix — because the fix is small, but only obvious once you see the mechanism.

Why random UUIDs wreck a B-tree

A PostgreSQL primary key is a B-tree index, and a B-tree stays cheap to maintain when new keys arrive in roughly ascending order. When the next key is always larger than the last, every insert lands on the right-most leaf page. That page stays hot in shared buffers, fills up, splits once, and moves on. Sequential keys give you near-perfect cache locality and minimal page splits.

UUID v4 is the opposite. Each value is 122 bits of randomness, so consecutive inserts scatter uniformly across the entire key space. The consequences compound:

Random leaf-page writes. Every insert targets a different, usually cold, leaf page. On a 18 GB index that means a buffer cache miss and a random read before you can even write.
Constant page splits everywhere. Because inserts hit pages that are already partially full at random positions, you trigger splits across the whole tree instead of just at the right edge. Splits are expensive and they fragment the index.
WAL write amplification via full-page images. The first time a page is touched after a checkpoint, PostgreSQL writes the entire 8 KB page into WAL (a full-page image) for crash safety. Random inserts touch far more distinct pages per checkpoint than sequential inserts, so you pay FPI costs over and over. That was the source of our 3–4× WAL amplification.
Index bloat and dead tuples. Splits plus HOT-update churn leave low-density pages that autovacuum has to chase, which is why vacuum never seemed to catch up.

None of this is a PostgreSQL flaw. It is just what happens when you feed a B-tree keys with no temporal locality. The data we insert does have a natural order — it arrives over time — but UUID v4 throws that ordering away.

UUID v7 keeps the ordering

UUID v7, standardized in RFC 9562 (which obsoleted the old RFC 4122 in 2024), is designed precisely for this. Instead of 122 random bits, it puts a 48-bit Unix millisecond timestamp in the most-significant position, followed by the version/variant bits and 74 bits of randomness:

bits 0–47: Unix epoch milliseconds (big-endian)
bits 48–51: version (0111 = 7)
bits 52–63: 12 random bits (often used for sub-millisecond monotonicity)
bits 64–65: variant (10)
bits 66–127: 62 random bits

Because the timestamp sits in the high bits, UUID v7 values generated over time sort almost perfectly in ascending order. That restores right-edge insert behavior in the B-tree while keeping the property we actually wanted from UUIDs in the first place: globally unique IDs that any worker, region, or edge node can mint without coordinating with the database or each other. For a distributed fetch pipeline pulling from multiple regions concurrently, that decentralized generation is non-negotiable — a BIGSERIAL would force every writer through a shared sequence.

Generating UUID v7 in PHP 8.4

Our ingestion workers run PHP 8.4 behind LiteSpeed. PHP has no built-in v7 generator, but it is about fifteen lines. The only subtlety is byte order: the timestamp must be big-endian so that lexicographic byte ordering matches chronological ordering.

<?php
declare(strict_types=1);

final class Uuid7
{
    /** Returns a canonical lowercase UUIDv7 string. */
    public static function generate(?int $unixMs = null): string
    {
        $unixMs ??= (int) (microtime(true) * 1000);

        // 48-bit timestamp, big-endian, packed into the first 6 bytes.
        $tsHex = str_pad(dechex($unixMs), 12, '0', STR_PAD_LEFT);
        $bytes = hex2bin($tsHex);              // 6 bytes
        $bytes .= random_bytes(10);            // 10 random bytes

        // Set version (7) in byte 6 high nibble.
        $bytes[6] = chr((ord($bytes[6]) & 0x0f) | 0x70);
        // Set variant (10xx) in byte 8 high bits.
        $bytes[8] = chr((ord($bytes[8]) & 0x3f) | 0x80);

        $hex = bin2hex($bytes);
        return sprintf(
            '%s-%s-%s-%s-%s',
            substr($hex, 0, 8),
            substr($hex, 8, 4),
            substr($hex, 12, 4),
            substr($hex, 16, 4),
            substr($hex, 20, 12)
        );
    }
}

// Two IDs minted a few ms apart sort in creation order:
echo Uuid7::generate(), PHP_EOL;

The random_bytes() call is cryptographically secure, so collision risk inside the same millisecond is negligible (74 bits of entropy per ms). If you generate thousands of IDs per millisecond on a single worker and want strict monotonicity even within a tick, you can replace the high random bits with an incrementing counter, but for our volume per worker that is unnecessary.

The schema and the storage decision

The table uses the native uuid type rather than bytea or text. This matters more than people expect. A uuid column is fixed 16 bytes; storing the same value as a 36-char text is 37 bytes and breaks index ordering subtly because it sorts as ASCII, not as bytes. Native uuid stores and compares the 16 raw bytes, which is exactly what we want.

CREATE TABLE videos (
    id           uuid PRIMARY KEY,            -- UUIDv7, minted by the worker
    region       text        NOT NULL,        -- 'JP','KR','TW','SG','TH','VN'...
    source_id    text        NOT NULL,        -- platform-native video id
    title        text        NOT NULL,
    channel_ref  uuid        NOT NULL,
    view_count   bigint      NOT NULL DEFAULT 0,
    fetched_at   timestamptz NOT NULL DEFAULT now(),
    metadata     jsonb       NOT NULL DEFAULT '{}'::jsonb,
    UNIQUE (region, source_id)
);

-- Dedup upsert on the natural key; the PK never collides.
INSERT INTO videos (id, region, source_id, title, channel_ref, view_count)
VALUES ($1, $2, $3, $4, $5, $6)
ON CONFLICT (region, source_id)
DO UPDATE SET view_count = EXCLUDED.view_count,
              title      = EXCLUDED.title,
              fetched_at = now();

Note the separation of concerns: id is the surrogate v7 key that makes the B-tree happy, while (region, source_id) is the business uniqueness constraint that prevents the same upstream video from being stored twice. Foreign keys elsewhere reference id, so we get compact 16-byte joins instead of dragging composite text keys around the schema.

One more PostgreSQL-specific win: because v7 keys are time-ordered, BRIN indexes on id become viable for archival range scans, and pg_repack runs faster because the physical and logical order largely agree.

Measuring it instead of trusting it

I do not deploy index changes on a hot table based on a blog post — including this one. We reproduced both schemes on a staging copy and inserted ten million rows each, measuring insert throughput and final index size. Here is the harness, in Python with psycopg, because it was the fastest way to script a fair comparison.

import time
import os
import psycopg

DSN = os.environ["STAGING_DSN"]
N = 10_000_000
BATCH = 1_000

def bench(table: str, id_sql: str) -> None:
    with psycopg.connect(DSN, autocommit=False) as conn:
        with conn.cursor() as cur:
            cur.execute(f"TRUNCATE {table}")
        conn.commit()

        start = time.perf_counter()
        with conn.cursor() as cur:
            for offset in range(0, N, BATCH):
                rows = [
                    (f"JP", f"vid-{offset + i}", "title")
                    for i in range(BATCH)
                ]
                cur.executemany(
                    f"INSERT INTO {table} (id, region, source_id, title, "
                    f"channel_ref, view_count) VALUES "
                    f"({id_sql}, %s, %s, %s, {id_sql}, 0)",
                    rows,
                )
                if offset % 100_000 == 0:
                    conn.commit()
        conn.commit()
        elapsed = time.perf_counter() - start

        with conn.cursor() as cur:
            cur.execute(
                "SELECT pg_size_pretty(pg_relation_size(%s || '_pkey'))",
                (table,),
            )
            idx_size = cur.fetchone()[0]

    print(f"{table:>14}: {N/elapsed:,.0f} rows/s, pk index {idx_size}")

# v4 path uses gen_random_uuid(); v7 path uses a SQL function we installed.
bench("videos_v4", "gen_random_uuid()")
bench("videos_v7", "uuid_generate_v7()")

The numbers from our run, on the staging box with an 8 GB shared_buffers and a warm checkpoint window:

Insert throughput: v7 sustained ~2.6× the rows/sec of v4 once the index outgrew shared_buffers. Below memory the gap was small; above it, v4 fell off a cliff and v7 stayed flat.
Primary-key index size: v4 finished at 19 GB with ~58% average leaf density; v7 finished at 11 GB with ~91% density. Same rows, nearly half the index.
WAL generated: v7 produced about 40% less WAL for the identical workload, almost entirely from fewer full-page images.

That density difference is the whole story. A denser index means more keys per cached page, which means more of the working set fits in RAM, which means fewer random reads, which is why p99 dropped.

You get the creation timestamp for free

A quiet bonus: because the timestamp lives in the high 48 bits, you can extract a row's creation time straight from its primary key without an extra created_at column. This is handy for sharding, retention sweeps, and debugging.

-- Extract the embedded Unix-ms timestamp from a UUIDv7.
CREATE FUNCTION uuid7_time(u uuid) RETURNS timestamptz AS $$
    SELECT to_timestamp(
        ('x' || substring(replace(u::text, '-', '') FROM 1 FOR 12))::bit(48)::bigint
        / 1000.0
    );
$$ LANGUAGE sql IMMUTABLE;

-- Everything ingested in the last hour, ordered by the PK itself:
SELECT id, region, title
FROM videos
WHERE uuid7_time(id) > now() - interval '1 hour'
ORDER BY id DESC
LIMIT 50;

We do still keep fetched_at, because it tracks re-fetch time on upsert rather than first-creation time, and the two diverge for long-lived rows. But for any append-only table, the v7 key alone is enough to range-scan by time.

Minting v7 at the edge in Go

Not all of our writers are PHP. A couple of Go collectors run closer to the upstream APIs in-region, and they mint the same v7 IDs before the row ever reaches PostgreSQL. Keeping the algorithm identical across languages is what lets any service generate a key the database will love. Go's standard library now has it, but the explicit version shows the bit layout clearly:

package main

import (
    "crypto/rand"
    "encoding/hex"
    "fmt"
    "time"
)

func UUIDv7() (string, error) {
    var b [16]byte
    if _, err := rand.Read(b[6:]); err != nil {
        return "", err
    }

    ms := uint64(time.Now().UnixMilli())
    b[0] = byte(ms >> 40)
    b[1] = byte(ms >> 32)
    b[2] = byte(ms >> 24)
    b[3] = byte(ms >> 16)
    b[4] = byte(ms >> 8)
    b[5] = byte(ms)

    b[6] = (b[6] & 0x0f) | 0x70 // version 7
    b[8] = (b[8] & 0x3f) | 0x80 // variant 10

    h := hex.EncodeToString(b[:])
    return fmt.Sprintf("%s-%s-%s-%s-%s",
        h[0:8], h[8:12], h[12:16], h[16:20], h[20:32]), nil
}

func main() {
    id, _ := UUIDv7()
    fmt.Println(id)
}

The byte-shift sequence is the same big-endian 48-bit timestamp as the PHP version. Cross-language consistency here is not cosmetic — if one writer emitted little-endian timestamps, its keys would scatter randomly and reintroduce exactly the bloat we removed.

What actually changed in production

We migrated by adding the new keys to fresh partitions and letting old v4 partitions age out, rather than rewriting 40 GB in place. After a month on v7 keys:

INSERT p99 dropped from ~30 ms back to ~4 ms and stayed there as volume grew.
The primary-key index shrank from 18 GB to roughly 10 GB for comparable row counts.
autovacuum on the table went from near-constant to occasional, freeing IO for the CJK FTS workload that actually serves users.
Cloudflare and our LiteSpeed caches were unaffected — this is purely an origin-database change — but faster ingestion means trending lists refresh with less lag, which is the user-visible payoff.

Caveats worth knowing before you switch

UUID v7 is not free of trade-offs, and pretending otherwise is how you get a different 3 a.m. page:

The creation time is exposed. Anyone holding an ID can read its millisecond timestamp. For internal video metadata that is fine, even useful. For IDs you hand to untrusted clients where creation time is sensitive, this leaks information — keep v4 (or a random opaque token) on those.
Monotonicity is only guaranteed across milliseconds, not within one. Two IDs minted in the same millisecond on different workers can sort in either order. That is fine for B-tree health; it is not a substitute for a real ordering column if you need strict global sequencing.
Clock skew matters. A worker with a clock running fast will mint keys that sort ahead of reality. Run NTP, and never use a wall clock that can jump backward as your sole monotonicity source.
It is still a 16-byte key. If your real problem is that you have too many indexes on a wide row, v7 helps the PK but will not save a poorly chosen secondary index.

Conclusion

The move from UUID v4 to UUID v7 was the highest-leverage database change we made all year, and it was about thirty lines of code plus a partition migration. The lesson generalizes past UUIDs: B-trees reward keys that arrive in time order, and any primary key you can make monotonic — without giving up distributed, coordination-free generation — will pay you back in index density, cache hit rate, and WAL volume. If you are running random UUID primary keys on a high-write table today, benchmark v7 on a staging copy this week. The index-size graph alone will make the decision for you.

DEV Community