DEV Community: Viktor Dorotovič

Partitioning InnoDB tables by time-based pseudo-sequential UUIDs

Viktor Dorotovič — Tue, 11 Oct 2022 22:13:24 +0000

Partitioning has multiple uses – spreading load onto multiple disks, cold storage of older data on cheaper disks, and probably others. Most importantly though, partitions are not for performance.

The main use case I’m going to explain is time-based partitioning as a tool for limiting the scope of stored data. This could be required to comply with a data retention policy or simply to save money on disk space.

A straightforward approach would be periodically issuing DELETE queries to get rid of rows that are older than a certain time period:

DELETE FROM temperature WHERE time_of_measurement < '2020-01-01';

This incurs a lot of I/O though. Each row has to be found, then removed from the page, while also updating each index pointing to the row.

If the table was range-partitioned by a time column, whole chunks of rows could be removed in one go:

ALTER TABLE temperature DROP PARTITION p_2019_11, p_2019_12;

Dropping a partition is much faster than deleting individual rows. Each partition is essentially a separate table, so when it is dropped, the space occupied by the table can just be marked as unused, instead of updating data structures row-by-row.

Case in point

As an example, I’m going to use a toy database for storing information about workshops and participants in IT conferences. There are going to be 2 entity classes, with a many-to-many relationship between them:

Workshop
Participant

A workshop can be attended by multiple participants, and a participant can attend multiple workshops. For simplicity, the fact that a participant is not able to attend two workshops at the same time will not be checked.

There are different data retention requirements – workshops will be stored for 5 years for historical reference, while participant records can be deleted after 2 years.

Apart from listing the individual members of each entity class, there are 2 read operations that the setup will need to support:

Get all participants attending a given workshop
Get all workshops that a given participant has signed up for

I ran the first attempt at defining the table schema:

CREATE TABLE workshops (
  workshop_id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  time_scheduled DATETIME NOT NULL
)
PARTITION BY RANGE COLUMNS(time_scheduled) (
  PARTITION p_default VALUES LESS THAN MAXVALUE
);

CREATE TABLE participants (
  participant_id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  time_created DATETIME NOT NULL
)
PARTITION BY RANGE COLUMNS(time_created) (
  PARTITION p_default VALUES LESS THAN MAXVALUE
);

CREATE TABLE workshop_participants (
  workshop_participant_id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  workshop_id BIGINT NOT NULL,
  participant_id BIGINT NOT NULL,  
  time_created DATETIME NOT NULL
)
PARTITION BY RANGE COLUMNS(time_created) (
  PARTITION p_default VALUES LESS THAN MAXVALUE
);

Uh-oh, MySQL is throwing an error:

A PRIMARY KEY must include all columns in the table's partitioning function (prefixed columns are not considered).

Apparently, this limitation applies to any unique constraint (index), not just primary keys, as described in MySQL’s documentation: Partitioning Keys, Primary Keys, and Unique Keys. I have not found any resource explaining why this is necessary though. PostgreSQL has the same restriction and explains it as follows: “This limitation exists because the individual indexes making up the constraint can only directly enforce uniqueness within their own partitions; therefore, the partition structure itself must guarantee that there are no duplicates in different partitions.“

Working example #1, but not particularly good

To get around this limitation, the primary keys can be defined as compound keys over the autoincremented ID and the time_created field:

CREATE TABLE workshops (
  workshop_id BIGINT NOT NULL AUTO_INCREMENT,
  name VARCHAR(255) NOT NULL,
  time_scheduled DATETIME NOT NULL,
  PRIMARY KEY (workshop_id, time_scheduled)
)
PARTITION BY RANGE COLUMNS(time_scheduled) (
  PARTITION p_future VALUES LESS THAN MAXVALUE
);

CREATE TABLE participants (
  participant_id BIGINT NOT NULL AUTO_INCREMENT,
  name VARCHAR(255) NOT NULL,
  time_created DATETIME NOT NULL,
  PRIMARY KEY (participant_id, time_created)
)
PARTITION BY RANGE COLUMNS(time_created) (
  PARTITION p_future VALUES LESS THAN MAXVALUE
);

CREATE TABLE workshop_participants (
  workshop_participant_id BIGINT NOT NULL AUTO_INCREMENT,
  workshop_id BIGINT NOT NULL,
  participant_id BIGINT NOT NULL,  
  time_created DATETIME NOT NULL,
  PRIMARY KEY (workshop_participant_id, time_created),
  INDEX wp_i_participant_id (participant_id),
  INDEX wp_i_workshop_id (workshop_id)
)
PARTITION BY RANGE COLUMNS(time_created) (
  PARTITION p_future VALUES LESS THAN MAXVALUE
);

The 2 required read operations will be served by these SELECT queries:

-- Get all participants attending a given workshop
SELECT p.participant_id, p.name, p.time_created 
FROM workshop_participants wp
JOIN participants p ON p.participant_id = wp.participant_id
WHERE wp.workshop_id = :workshop_id: ;


-- Get all workshops that a given participant has signed up for
SELECT w.workshop_id, w.name, w.time_scheduled
FROM workshop_participants wp
JOIN workshops w ON w.workshop_id = wp.workshop_id
WHERE wp.participant_id = :participant_id: ;

There are a couple of problems with this approach. Firstly, workshop_id and participant_id are being used as if they were unique, which are not actually guaranteed to be – it’s only guaranteed if the autoincrement default value is not overridden.

Secondly, partition pruning cannot be used. Partition pruning is an optimizer feature that uses existing information contained in the query to establish a subset of partitions that is guaranteed to contain all the data that’s being looked for. To put it another way, it prunes partitions that cannot contain any matching rows based on the WHERE condition.

In this case, I can validate this by creating the partitions p_2020_01, p_2020_02, adding some sample data and looking at the output of EXPLAIN:

ALTER TABLE workshops
REORGANIZE PARTITION p_future INTO (
        PARTITION p_old VALUES LESS THAN ('2020-01-01'),
        PARTITION p_2020_01 VALUES LESS THAN ('2020-02-02')
        PARTITION p_2020_02 VALUES LESS THAN ('2020-02-03')
        PARTITION p_future VALUES LESS THAN MAXVALUE
);

INSERT INTO workshops (workshop_id, name, time_scheduled) VALUES
(1, 'UUIDs - when and how to use them', '2020-01-18 10:11'),
(2, 'Partitioning InnoDB tables', '2020-01-18 11:32'),
(2, 'Denormalizing database schemas for performance', '2020-02-05 14:30');

INSERT INTO participants (participant_id, name, time_created) VALUES 
(1, 'john Doe', '2020-01-01 15:00');

INSERT INTO workshop_participants (workshop_id, participant_id, time_created) VALUES
(1, 1, '2020-01-06 10:02'),
(1, 2, '2020-01-06 10:05');

SELECT wp.workshop_id FROM workshop_participants AS wp WHERE wp.participant_id=1;

-- Output:
-- | workshop_id |
-- | ----------- |
-- | 1           |
-- | 2           |

-- for MySQL < 5.8, use EXPLAIN PARTIIONS
EXPLAIN SELECT w.workshop_id, w.name, w.time_scheduled FROM workshops AS w WHERE w.workshop_id IN (1, 2);

-- Output:
-- | id  | select_type | table | partitions                         | type  | possible_keys | key     | ...
-- | --- | ----------- | ----- | ---------------------------------- | ----- | ------------- | ------- | ...
-- | 1   | SIMPLE      | w     | p_old,p_2020_01,p_2020_02,p_future | range | PRIMARY       | PRIMARY | ...



-- Try it yourself with DB Fiddle: https://www.db-fiddle.com/f/uJbWPwS5dU9XNUCQPBMuNs/0

The EXPLAIN output shows that all of the defined partitions need to be visited when evaluating the SELECt query, even though all the fetched rows are stored in a single partition.

Note that instead of using a JOIN or a WHERE IN (SELECT ...) condition, the queries are separated into 2. The first one fetches a list of IDs from the bridging table, and the second includes a constant IN(…) clause. It’s not clear to me at the time of writing, whether partition pruning can work with dynamically obtained values during query execution in InnoDB, I’ve not seen it mentioned anywhere. PostgreSQL supports run-time pruning.

Working example #2, getting closer

The proper solution would be using the whole compound keys as foreign keys in the bridging table:

CREATE TABLE workshop_participants (
  workshop_participant_id BIGINT NOT NULL AUTO_INCREMENT,
  workshop_id BIGINT NOT NULL,
  workshop_time_scheduled DATETIME NOt NULL,
  participant_id BIGINT NOT NULL,  
  participant_time_created DATETIME NOT NULL,
  time_created DATETIME NOT NULL,
  PRIMARY KEY (workshop_participant_id, time_created),
  INDEX wp_i_workshop (workshop_id, workshop_time_scheduled),
  INDEX wp_i_participant (participant_id, participant_time_created)
)
PARTITION BY RANGE COLUMNS(time_created) (
  PARTITION p_future VALUES LESS THAN MAXVALUE
);

SELECT wp.workshop_time_scheduled, wp.workshop_id FROM workshop_participants AS wp WHERE wp.participant_id=1;

-- Output:
-- | workshop_time_scheduled | workshop_id |
-- | ----------------------- | ----------- |
-- | 2020-01-18 10:11:00     | 1           |
-- | 2020-01-18 11:32:00     | 2           |

EXPLAIN SELECT w.workshop_id, w.name, w.time_scheduled
FROM workshops AS w
WHERE (w.time_scheduled, w.workshop_id) IN (
('2020-01-18 10:11:00', 1),
('2020-01-18 11:32:00', 2)
)

-- | id  | select_type | table | partitions | type  | possible_keys | key     | key_len | ref | rows | filtered | Extra       |
-- | --- | ----------- | ----- | ---------- | ----- | ------------- | ------- | ------- | --- | ---- | -------- | ----------- |
-- | 1   | SIMPLE      | w     | p_2020_01  | range | PRIMARY       | PRIMARY | 13      |     | 2    | 100      | Using where |



--- Try it in DD Fiddle: https://www.db-fiddle.com/f/vQCCQFcqTj9dndNwPDmqVT/0

Now that the WHERE ... IN pairs start with the value for the partitioning column, the query optimizer can remove non-matching partitions from the plan. This is quite unwieldy though and may be especially problematic when combined with an object-relational mapper that may not support compound primary keys.

A better solution — using UUID7

Apart from other UUID advantages over AUTO_INCREMENT, UUID7 is prefixed by the UNIX timestamp, which means that the lexicographic ordering of UUID7 values is the same as the order in which the IDs were generated in time. To get a better picture of the structure of UUID7, take a look at my previous post about UUID performance in indices. The timestamp prefix makes it possible to define discrete ranges of UUID7 values the same way time ranges would be defined. Be aware though, that UUID version 7 is defined as an Internet Draft at the time of writing, and may change.

When comparing a time to a UUID7, equality comparisons cannot be used, because the random part of UUID7 may be any value. Only less-than or greater-than relations are useful.

I made a function that can create a UUID7 from a custom datetime object and fills in the rest of the bytes with either 0x00 or 0xff depending on whether the value is going to be used as a lower or upper boundary:

import datetime
from uuid6 import UUID, uuid7

def uuid7_time_boundary(dt, lower=True):
    timestamp_ms = int(dt.timestamp() * 1000)
    uuid_int = (timestamp_ms & 0xFFFFFFFFFFFF) << 80
    fill_byte = "00" if lower else "ff"
    uuid_int = uuid_int | int(fill_byte * 10, 16)
    # using the UUID constructor from uuid6 library, which automatically populates the version and variant fields
    return UUID(int=uuid_int, version=7)

The uuid6 library in (alpha) version 2022.6.25 implements both uuid6 and uuid7 from the draft. It doesn’t match draft 4 exactly though — it stores the submillisecond parts of the timestamp at the beginning of what’s supposed to be pseudorandom bits. The boundary values for comparison are only computed with 1-second precision, so they would work with either implementation.

UUID7 < time

To find UUID7-s generated before 2022–04–03 10:30:00, we can just build a boundary UUID from this date and compare UUID objects directly:

dt = datetime.datetime.strptime("2022-04-03 10:30:00", "%Y-%m-%d %H:%M:%S").replace(tzinfo=datetime.timezone.utc)

    ####################################################################################################################
    # Using an upper boundary with <=:
    ####################################################################################################################
    boundary_uuid = uuid7_time_boundary(dt, lower=True)

    earlier_uuid = uuid7_time_boundary(dt - datetime.timedelta(milliseconds=1), lower=False)

    # compare with earlier uuid
    print("%s\n<\n%s\n== %s\n" % (earlier_uuid, boundary_uuid, earlier_uuid < boundary_uuid))
    # 017feef9-743f-7fff-bfff-ffffffffffff
    # <
    # 017feef9-7440-7000-8000-000000000000
    # == True

    later_uuid = UUID('018ea382-283f-7bfc-addb-a6c06bf382da')  # 2024-04-03

    # compare with later uuid
    print("%s\n<\n%s\n== %s\n" % (later_uuid, boundary_uuid, later_uuid < boundary_uuid))
    # 018ea382-283f-7bfc-addb-a6c06bf382da
    # <
    # 017feef9-7440-7000-8000-000000000000
    # == False

    ####################################################################################################################
    # Using an upper boundary with <=:
    ####################################################################################################################

    upper_boundary_uuid = uuid7_time_boundary(dt - datetime.timedelta(milliseconds=1), lower=False)

    print("%s\n<=\n%s\n== %s\n" % (earlier_uuid, upper_boundary_uuid, earlier_uuid <= upper_boundary_uuid))
    # 017feef9-743f-7fff-bfff-ffffffffffff
    # <=
    # 017feef9-743f-7fff-bfff-ffffffffffff
    # == True

    print("%s\n<=\n%s\n== %s\n" % (later_uuid, upper_boundary_uuid, later_uuid <= upper_boundary_uuid))
    # 018ea382-283f-7bfc-addb-a6c06bf382da
    # <=
    # 017feef9-743f-7fff-bfff-ffffffffffff
    # == False

Partition boundaries

In the world of MySQL, the partitions’ limits can be defined by using hexadecimal constants starting with 0x. This is valid SQL:

CREATE TABLE workshops (
  workshop_id BINARY(16) NOT NULL, -- UUIDs will be stored in compact binary form
  name VARCHAR(255) NOT NULL,
  time_scheduled DATETIME NOT NULL,
  PRIMARY KEY (workshop_id, time_scheduled)
)
PARTITION BY RANGE COLUMNS(workshop_id) (
  -- monthly partitions
  -- for the LESS THAN value we're using the boundary UUID encoded as a hex constant (with dashes removed)
  PARTITION p_old     VALUES LESS THAN (0x016f5e66e80070008000000000000000), -- < 2020-01-01 00:00:00
  PARTITION p_2020_01 VALUES LESS THAN (0x016ffe0c0c0070008000000000000000), -- < 2020-02-01 00:00:00
  PARTITION p_2020_02 VALUES LESS THAN (0x01709364780070008000000000000000), -- < 2020-03-01 00:00:00  
  PARTITION p_future  VALUES LESS THAN MAXVALUE
);

CREATE TABLE participants (
  participant_id BINARY(16) NOT NULL,
  name VARCHAR(255) NOT NULL,
  time_created DATETIME NOT NULL,
  PRIMARY KEY (participant_id, time_created)
)
PARTITION BY RANGE COLUMNS(participant_id) (
  PARTITION p_old     VALUES LESS THAN (0x016f5e66e80070008000000000000000), -- < 2020-01-01 00:00:00
  PARTITION p_2020_01 VALUES LESS THAN (0x016ffe0c0c0070008000000000000000), -- < 2020-02-01 00:00:00
  PARTITION p_2020_02 VALUES LESS THAN (0x01709364780070008000000000000000), -- < 2020-03-01 00:00:00  
  PARTITION p_future  VALUES LESS THAN MAXVALUE
);

CREATE TABLE workshop_participants (
  workshop_participant_id BINARY(16) NOT NULL,
  workshop_id BINARY(16) NOT NULL,
  participant_id BINARY(16) NOT NULL,  
  PRIMARY KEY (workshop_participant_id),
  INDEX wp_i_workshop_id (workshop_id),
  INDEX wp_i_participant_id (participant_id)
)
PARTITION BY RANGE COLUMNS(workshop_participant_id) (
  PARTITION p_old     VALUES LESS THAN (0x016f5e66e80070008000000000000000), -- < 2020-01-01 00:00:00
  PARTITION p_2020_01 VALUES LESS THAN (0x016ffe0c0c0070008000000000000000), -- < 2020-02-01 00:00:00
  PARTITION p_2020_02 VALUES LESS THAN (0x01709364780070008000000000000000), -- < 2020-03-01 00:00:00  
  PARTITION p_future  VALUES LESS THAN MAXVALUE
);

Splitting the many-to-many bridging table

In its current form, the bridging table is not great, because selecting from workshop_participants either by workshop_id or by participant_id would not use partition pruning, because none of these foreign keys are in the partitioning expression. As a solution, the bridging table can be split into two:

CREATE TABLE workshop_participants (
  workshop_id BINARY(16) NOT NULL,
  participant_id BINARY(16) NOT NULL,
  time_created DATETIME NOT NULL,
  PRIMARY KEY (workshop_id, participant_id)
)
-- partitioning condition is satisfied - the partitioning column is part of the primary key
PARTITION BY RANGE COLUMNS(workshop_id) (
  -- ...  
  PARTITION p_future  VALUES LESS THAN MAXVALUE
);

CREATE TABLE participant_workshops (
  participant_id BINARY(16) NOT NULL,
  workshop_id BINARY(16) NOT NULL,
  time_created DATETIME NOT NULL,
  PRIMARY KEY (participant_id, workshop_id)
)
-- partitioning condition is satisfied - the partitioning column is part of the primary key
PARTITION BY RANGE COLUMNS(participant_id) (
  -- ...  
  PARTITION p_future  VALUES LESS THAN MAXVALUE
);

This technique also solves the problem of when partitions are going to be purged from the bridging table if the 2 connected tables have different data retention times. If one side is going to be deleted earlier than the other, there are going to be hanging rows in the other table of the split, but that’s bearable. Since actually defined foreign keys are not supported with partitioning the application logic must be written in such a way as to deal with the possibility of missing rows.

As a side effect, the split tables will also probably take up less disk space than the original combined one, because instead of 3 indexes — 1 clustered and two secondary indexes for the foreign keys, there are now only 2 clustered indexes — 1 for each table. I haven’t checked it though.

The SELECTs for the required read operations also need to be updated. Selecting all participants attending a given workshop will be accomplished as follows:

SELECT HEX(participant_id) FROM workshop_participants WHERE workshop_id = 0x016fb86c8f807d1d9a8960a15b7eb63e;

-- Output:
-- | HEX(participant_id)              |
-- | -------------------------------- |
-- | 016D87D5D180713484C5D4E471CCA418 |
-- | 016F619EE18071D085FD3968AA46EF0B |
EXPLAIN SELECT HEX(participant_id) FROM workshop_participants WHERE workshop_id = 0x016fb86c8f807d1d9a8960a15b7eb63e;
-- Explain:
-- | id  | select_type | table                 | partitions | type | possible_keys | key     | key_len | ref   | rows | filtered | Extra                    |
-- | --- | ----------- | --------------------- | ---------- | ---- | ------------- | ------- | ------- | ----- | ---- | -------- | ------------------------ |
-- | 1   | SIMPLE      | workshop_participants | p_2020_01  | ref  | PRIMARY       | PRIMARY | 16      | const | 2    | 100      | Using where; Using index |

SELECT HEX(participant_id), name, time_created FROM participants WHERE participant_id IN (0x016D87D5D180713484C5D4E471CCA418, 0x016F619EE18071D085FD3968AA46EF0B);

-- Output:
-- | HEX(participant_id)              | name     | time_created        |
-- | -------------------------------- | -------- | ------------------- |
-- | 016D87D5D180713484C5D4E471CCA418 | Jane Doe | 2019-10-01 15:00:00 |
-- | 016F619EE18071D085FD3968AA46EF0B | john Doe | 2020-01-01 15:00:00 |

EXPLAIN SELECT HEX(participant_id), name, time_created FROM participants WHERE participant_id IN (0x016D87D5D180713484C5D4E471CCA418, 0x016F619EE18071D085FD3968AA46EF0B);
-- Explain
-- | id  | select_type | table        | partitions      | type  | possible_keys | key     | key_len | ref | rows | filtered | Extra       |
-- | --- | ----------- | ------------ | --------------- | ----- | ------------- | ------- | ------- | --- | ---- | -------- | ----------- |
-- | 1   | SIMPLE      | participants | p_old,p_2020_01 | range | PRIMARY       | PRIMARY | 16      |     | 2    | 100      | Using where |

To get all workshops that a given participant has signed up for, the other table of the split many-to-many bridge is accessed:

SELECT HEX(workshop_id) FROM participant_workshops WHERE participant_id = 0x016f619ee18071d085fd3968aa46ef0b;

-- Output:
-- | HEX(workshop_id)                 |
-- | -------------------------------- |
-- | 016FB82267207367A7E6E580C4871E32 |
-- | 016FB86C8F807D1D9A8960A15B7EB63E |

EXPLAIN SELECT HEX(workshop_id) FROM participant_workshops WHERE participant_id = 0x016f619ee18071d085fd3968aa46ef0b;

-- Explain:
-- | id  | select_type | table                 | partitions | type | possible_keys | key     | key_len | ref   | rows | filtered | Extra                    |
-- | --- | ----------- | --------------------- | ---------- | ---- | ------------- | ------- | ------- | ----- | ---- | -------- | ------------------------ |
-- | 1   | SIMPLE      | participant_workshops | p_2020_01  | ref  | PRIMARY       | PRIMARY | 16      | const | 2    | 100      | Using where; Using index |

SELECT HEX(workshop_id), name, time_scheduled FROM workshops WHERE workshop_id IN (0x016FB82267207367A7E6E580C4871E32, 0x016FB86C8F807D1D9A8960A15B7EB63E);

-- Output:
-- | HEX(workshop_id)                 | name                             | time_scheduled      |
-- | -------------------------------- | -------------------------------- | ------------------- |
-- | 016FB82267207367A7E6E580C4871E32 | UUIDs - when and how to use them | 2020-01-18 10:11:00 |
-- | 016FB86C8F807D1D9A8960A15B7EB63E | Partitioning InnoDB tables       | 2020-01-18 11:32:00 |

EXPLAIN SELECT workshop_id, name, time_scheduled FROM workshops WHERE workshop_id IN (0x016FB82267207367A7E6E580C4871E32, 0x016FB86C8F807D1D9A8960A15B7EB63E);

-- Explain:
-- | id  | select_type | table     | partitions | type  | possible_keys | key     | key_len | ref | rows | filtered | Extra       |
-- | --- | ----------- | --------- | ---------- | ----- | ------------- | ------- | ------- | --- | ---- | -------- | ----------- |
-- | 1   | SIMPLE      | workshops | p_2020_01  | range | PRIMARY       | PRIMARY | 16      |     | 2    | 100      | Using where |

All of the above queries take advantage of partition pruning.

The sample data and all queries are available for playing around with at DB Fiddle, or as a downloadable .sql file.

Hiccups

Listing partitions in INFORMATION_SCHEMA.PARTITIONS

The read-only view can be used to list all partitions of a given table. For range partitions, the boundary value is included in the PARTITION_DESCRIPTION field. This works fine for datetime fields but binary values are garbled. If you need to read back this value, it’s better to manually encode it in the partition’s name.

SELECT PARTITION_NAME, PARTITION_ORDINAL_POSITION, PARTITION_DESCRIPTION FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME = 'workshops';

-- Output MySQL 5.7:
-- | PARTITION_NAME | PARTITION_ORDINAL_POSITION | PARTITION_DESCRIPTION |
-- | -------------- | -------------------------- | --------------------- |
-- | p_old          | 1                          | 'o^f                 |
-- | p_2020_01      | 2                          | 'o                   |
-- | p_2020_02      | 3                          | 'p                   |
-- | p_future       | 4                          | MAXVALUE              |

Update: It’s been fixed in MySQL 8.0:

SELECT PARTITION_NAME, PARTITION_ORDINAL_POSITION, PARTITION_DESCRIPTION FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME = 'workshops';
-- Output MySQL 8.0:

-- | PARTITION_NAME | PARTITION_ORDINAL_POSITION | PARTITION_DESCRIPTION                      |
-- | -------------- | -------------------------- | ------------------------------------------ |
-- | p_future       | 4                          | MAXVALUE                                   |
-- | p_2020_02      | 3                          | _binary 0x01709364780070008000000000000000 |
-- | p_2020_01      | 2                          | _binary 0x016FFE0C0C0070008000000000000000 |
-- | p_old          | 1                          | _binary 0x016F5E66E80070008000000000000000 |

ORDER BY partitioning_col LIMIT N

A typical query for this would be “Get me the latest 10 items”. When the table is partitioned by time, it would make sense to use the implicit ordering or partitions and start from the latest partition. If fewer than 10 items were found, then continue onto the next partition, and so on. After 10 items were found, the query can immediately finish. AFAIK this is not implemented in MYSQL, instead, it will access all partitions and do a merge sort. PostgreSQL has an improved query planner and supports ordered partition scans since version 12.

Other DB Engines

I haven’t tested this on DB engines other than MySQL (with InnoDB storage), but the basic principle should work. It all depends on whether binary sequences can be used as partition boundaries. I’d be curious to know if you’re using a similar approach, as well as what works and what doesn’t.

UUIDs Are Bad for Database Index Performance, enter UUID7!

Viktor Dorotovič — Fri, 07 Oct 2022 22:10:14 +0000

UUIDs, Universal Unique Identifiers, are a specific form of identifier designed to be unique even when generated on multiple machines. Compared to autoincremented sequential identifiers commonly used in relational databases, generating does not require centralized storage of the current state, I.e., the identifiers that have already been allocated. This is useful when a centralized system would pose a performance bottleneck or a single point of failure. UUIDs are designed to be able to support very high allocation rates, up to 10 million per second per machine.

Despite the fact that some types (eg., UUID4) are not guaranteed to be unique by their method of generation, the chance of generating two conflicting UUIDs is very low. This is also due to the fact that UUIDs are 128 bits long.

UUIDs were formally defined as an Internet Draft in 2002, which was promoted to a Proposed Standard as RFC 4122. At the time of writing, 17 years later, it still has the status of a proposal.

Representation & structure

UUIDs are just 128 bit numbers, or 16-byte long binary sequences, and can be stored as such for efficiency. In API calls, they are commonly transferred in a text format of hexadecimal sequenced separated by dashes in the pattern of 8-4-4-4-12. This mirrors the internal structure as defined by the RFC:

   0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          time_low                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |       time_mid                |         time_hi_and_version   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |clk_seq_hi_res |  clk_seq_low  |         node (0-1)            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         node (2-5)                            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

UUID versions

Even though the layout looks like a single schema for UUID generation, there are multiple versions of how the individual fields are populated – UUID1 to UUID5. The version is actually encoded in the UUID itself and makes up the most significant 4 bits of the time_hi_and_version field.

UUID1 - time-based

UUID2 – DCE security version

The specification of this version is available at https://pubs.opengroup.org/onlinepubs/9696989899/chap5.htm#tagcjh_08_02_01_01

UUID3 – name-based, MD5

UUID4 – random

UUID5 – name-based, SHA1

UUIDs and B-tree indices

As described in the post comparing random and sequential primary keys, B-tree indexes are well suited for inserting keys when values are increasing and not so much when the values are randomly distributed. Notice something about UUIDs? Neither version is designed to generate increasing values, where the sort order is similar to how the UUIDs were generated in time. No, not even UUID1. UUID1 is based on time, but the timestamp is split up into multiple segments and they are in the wrong order in the UUID structure. The least significant bits – the bits that have the highest change frequency are located at the start of the UUID stream – as the time_low field.

UUID1 vs UUID4

Commonly used UUID versions are 1 and 4 because they can be generated “out of thin air”, without requiring some custom values (namespace, name).
Considering that none of them is increasing, there should be little difference in insert performance between UUID1 and UUID4.

To test this hypothesis, I extended the test rig developed for comparing random and sequential integer IDs. The script inserts equal-sized chunks of records into a table and tracks the I/O write volume of the DB engine generated by the inserts. The I/O measurement is done by running the DB engine in a Docker container and polling Docker’s stats API.

For this comparison, I’ve used SQLite in clustered mode.

Umm, nope. Hypothesis busted. UUID1 performs much better.

The reason is that while overall, uuid1 isn’t ordered by time, it is increasing in short periods until time_low rolls over and starts from 0 again. The time it takes time_low to roll over is about 7 minutes and 9 seconds. After this period, filled B-tree pages will be revisited again and page splitting will occur.

Inserting 1M records during the test took about 33 minutes, so this would mean only 4 rollovers. In order to show how it would perform on a longer timescale, I created a modified version of UUID for which time passes 100× faster – uuid1_fastrollover.

Designing efficient UUIDs

What if a new UUID version could be designed that would take the randomness of UUID4 and combine it with a timestamp prefix? This would make the UUID increase overall, but not locally – due to the random postfix. The random part ensures uniqueness when a high generation rate is necessary and also makes the UUIDs hard to predict – it’s not possible to guess the previous, or next UUID. It’s fairly simple to devise a custom UUID scheme, but fortunately, there is a new Internet-Draft (at the time of writing) defining new pseudo-sequential UUID versions that aim to solve exactly this issue: draft-peabody-dispatch-new-uuid-format-04. The current state and progress can be viewed at IETF Datatracker.

UUID 6 – field-compatible version of UUID1

The original field structure is kept, but the timestamp fields are shuffled around so that the whole timestamp is in the correct order – starting with the most significant bits and ending with the least significant ones. This way, consecutively generated UUIDs are always increasing in value.

UUID 7 – time-ordered

This is is the go-to version for use in newly developed systems, where forward compatibility is not necessary. Notice that UUID1 uses a Gregorian epoch timestamp, while UUID7 is defined to use the Unix epoch timestamp.

UUID 8 – custom, vendor-specific

I have included this version just for completeness – it is in the draft, but here the ordering will clearly depend on the custom data, so this version is not going to be tested.

Testing & comparison

Implementation notes – which Python library was used for UUID7 and UUID8?

I’ve found a library on PyPI that appears to implement UUID7 – uuid7, but upon further inspection of (the only) version 0.1.0, it does not seem to match the structure defined in the draft at the time of writing.

Confusingly enough, there is a package named uuid6 that implements both UUID6 and UUID7 and seems to follow the draft. I used it for the test.

InnoDB – clustered index

UUID6 and UUID7 perform equally well, as expected.

PostgreSQL – non-clustered index

When using Postgres’s non-clustered index, there is a similar overhead for UUID4 over UUID6/7. The I/O is an order of magnitude lower than for InnoDB though.

Conclusion

When comparing UUID1 and UUID4 from the original RFC, UUID1 performs much better even though the time fields are in the wrong order, this is because the low bits still take about 7 minutes to roll over, and during this time the UUID value is increasing.

UUID6 and UUID7 from the draft take this further and keep the ordering of generated values over the full timeline. Even though the versions are not yet standardized, there are existing libraries, at least for Python, that follow the draft. When designing a new database, UUID7 is the go-to option and would perform well in most database engines.

If you’re working on a system that already generates UUID1 or UUID4 but you can still choose a DB engine that supports non-clustered indexes like PostgreSQL, then it is a viable option. There is a performance hit, but much less pronounced compared to InnoDB’s clustered index. Microsoft SQL Server also supports both index types.

Alternative solutions & related studies

Universally Unique Lexicographically Sortable Identifier

UUIDs are Popular, but Bad for Performance — Let’s Discuss – Percona Database Performance Blog

Illustrating Primary Key models in InnoDB and their impact on disk usage – Percona Database Performance
I dig the straightforward graphical view of InnoDB page allocation.

MySQL 8 can convert UUID1 to binary and shuffle the time fields to the “correct” order:
UUID_TO_BIN(string_uuid, swap_flag)
BIN_TO_UUID(binary_uuid, swap_flag)
Rick James: GUID/UUID Performance Breakthrough

Choose the right primary key to save a large amount of disk I/O

Viktor Dorotovič — Wed, 05 Oct 2022 11:22:17 +0000

Imagine you’re working in a large book warehouse and in charge of new arrivals. There’s a separate, digital system for metadata like authors, categories, etc., so the only information you’ll use during storage is the inventory number. Each book is identified by a unique number with many digits and all books must be findable by their number. To make handling quicker, books are packed in boxes, ordered by the inventory number. When looking for a book, the box must be identified first. Therefore, each box is labeled by the lowest inventory number it contains and the first number that’s in the next box.

Your job in the arrivals room is to pick up books-to-be-stored one by one, assign them a new inventory number in the metadata system, label them by number, and put them in a box as mentioned before. Now, the room is quite small and if you run out of space, you’ll need to move the filled boxes into the basement, which might be two floors down.

The way new inventory numbers are created is important here, and we’ll consider two of the many methods to do it.

1. Roll the dice and generate a completely random number

Considering that the inventory number has many digits, the chance of two books getting the same number is minimal, so this method would work fine. After a book is given its number, it needs to be stored in the correct spot in the correct box. However, because the numbers are generated across the whole range, the book likely needs to go into a box that was already removed from the arrivals room. Therefore, the box will need to be fetched from the basement, and the book placed in it. If the box is already full, it will need to be split into two boxes, each of which will now be about half-empty even after placing the new book. The same process is repeated for each subsequent book.

2. Use a counter and just increase the number by 1 for each book

In this case, boxes can just be filled from start to end, and their order is guaranteed. If a box becomes full, the next books are put into another empty box, until it becomes full as well. The process then repeats, and there’s no need to fetch anything from the basement.

Which one of these two methods will allow the handling of more books per day with the same size arrivals room and a single worker? In this case, the one with the sequential numbers, obviously.

Why are random identifiers even a thing?

Generating random inventory numbers might sound absurd, but there is a security advantage.

Let’s assume that the only information required for fetching a book is its number, and there are multiple clients storing books. If the numbers are sequential, a client might get crafty and try to fetch a book that doesn’t belong to him, by taking his highest inventory number and just adding 1. If the book was not the last in storage, the misbehaving client is guaranteed to get a book that is not their own. In the random system, a client can still guess inventory numbers, but there’s a low chance of getting a correct hit, because there are large gaps between individual inventory numbers, even when they’re in order (remember that the numbers have many digits). ¹

Another advantage is that random inventory numbers do not give out information about how many books were processed in a given time period.

Indexes in relational databases

The box system as described previously is an analogy for the leaf-level nodes of a B-tree, a data structure commonly used to create lookup indices in relational databases. A B-tree is not limited to one level though. The “boxes” are put into larger boxes with the range also written on them. In this way, not every box needs to be examined when doing a lookup. If there is enough data, the 2nd level can be put into even larger boxes and so on, until there is only a single “root” box that ranges over all of the identifiers in storage.
The problem of inserting records into random places in the leaf nodes still holds though.

When looking at how storage engines are implemented, there are two ways to organize rows in a table’s data space (tablespace):

1. Non-clustered index

The rows in a table using a non-clustered index are written into the tablespace in the order they are received, so there’s no need to access old(er) pages. If it’s necessary to quickly find a row by an identifier, a separate index structure is built that only contains the index values and a pointer to the location of the row in the main data space.

2. Clustered index

All the data is held inside the structure of the primary index. If the access pattern is random, when the pages are retrieved from the disk, they will mostly contain actual record data and not just index values. Data from existing records is useless when inserting new records, but it still takes up space in the page cache. This means that a lesser portion of the table’s rows fit in the cache, compared to when only the index values are stored in the B tree. This leads to more cache misses.

While the random access pattern is not efficient for non-clustered (secondary) indices, it is especially inefficient when the index values only make up a small part of the full row and a clustered index is used.

Experiment and measure!

To demonstrate, I’ve written a smallish test rig that supports three database engines: SQLite, MariaDB (MySQL) and PostgreSQL.

The test script fills a table with the following structure:

+--------------+--------------+--------------------------+
|    Column    |     Type     |         Comment          |
+--------------+--------------+--------------------------+
| id           | BIGINT       | Primary key              |
| time_created | VARCHAR (64) | Filler data, unimportant |
| data         | TEXT (10240) | Random data              |
+--------------+--------------+--------------------------+

To minimize the impact of data generation and other processes running on the test machine, the main metric I’ve chosen to monitor is the bulk of I/O operations. Compared to a time-based metric, such as sustained inserts per second, it is much more stable.

While disk I/O might not be the bottleneck in all scenarios, computing (CPU) power and network connectivity also matters.

I’ve opted to start the database instances inside Docker containers, so the I/O can be measured using Docker’s stats API.

SQLite

The first tests were done using SQLite, because it supports both options. By default, it uses non-clustered indexes; however, a clustered index can be created for the primary by using the WITHOUT ROWID keyword during table definition.

For each database engine, the page size was set to 16KB, and the page cache size to 10MB.

In the case of SQLite this meant setting the following PRAGMAs:

PRAGMA page_size = 16384; PRAGMA cache_size = -10240;

Non-clustered index

Clustered index

Whoa!

Both clustered and non-clustered in the same plot for comparison:

Graphing writes per one row inserted

By differentiating the series for write volume, we can calculate the write volume needed for a single inserted row.

The data needed some smoothing, but the writes-per-one-row seem to grow logarithmically with the size of the table:

The random primary key’s performance (red) seems to match the B-tree‘s expected insert time complexity of O(log n). Still, we can do much better because pages can be kept in the cache during filling and written out only once (yellow).

MySQL (InnoDB)

The simpler MyISAM storage engine in MySQL is quite limited and doesn’t support transactions. In the current version, InnoDB is the default storage engine and sadly, each table is built around a clustered index.

The following config variables were used to keep the comparisons fair:

innodb_page_size=16k
innodb_buffer_pool_size=10M

As we can see, InnoDB performs similarly to SQLite’s clustered index.

PostgreSQL

In the case of PostgreSQL, clustered indices are not supported. This can be confusing because a CLUSTER command can be issued, but it is only a one-time operation and will not affect continuous writes.

The impact of random IDs should be much lower and should be similar to SQLite’s default non-clustered index.

As for the setup, PostgreSQL can only change its page size (block size) as a compilation parameter, so I kept the default value of 8KB. I set the page cache (shared_buffers) and WAL size to 10MB by passing the following arguments to the postgres executable:

-c shared_buffers=1280 -c wal_buffers=1280

Quite a surprise! It seems that Postgres handles random values better than sequential ones. On a larger scale, though, this relationship switches back to the expected one, where random values “lose:”

When taking the differentiated writes (per row) on a larger table, the overhead of random values is visible as well:

The difference is obviously less significant than on a clustered index, but it is still there.

If you want to use a random identifier or have an existing back-end application that generates random id-s, and you are free to choose a relational database, PostgreSQL will meet your needs.

Conclusion

When using a clustered index in a table that is not purely composed of index values, generating completely random values for the primary key across the full set of possible values will lead to many disk I/O operations because the page cache cannot be used effectively.

Note: A clustered index is still useful when scanning a large number of rows on a range of the primary key ², but this is unlikely to be needed with completely random identifiers.

How much I/O can be saved?

The number of saved I/O operations is not constant and it depends on the size of the index and therefore on the number of records written. The saved I/O can be estimated by taking the integral of the B-tree insert complexity for 1 insert:

\int ln (n) d n = n (ln (n) - 1) + co n s t an t

If we assume that the sequential id’s insert complexity is constant 1, the resulting ratio is as follows:

s a v e d I / O p or t i o n \approx 1 - \frac{n}{n ( ln ( n ) - 1 )} \approx 1 - \frac{1}{l n ( n ) - 1}

When inserting one million rows into a clustered index, using a sequential primary key would save approximately 92% of I/O volume.

UUIDs

Universally Unique Identifiers (UUIDs) have a couple of advantages over sequential id-s generated by the RDBMS (AUTOINCREMENT, SERIAL), mainly that they can be generated prior to sending the insert query to the DB. Unfortunately, InnoDB with its clustered index is one of the most popular storage engines. Using it with UUIDs, which are random by-design, leads to excess disk I/O, which is non-trivial to avoid.

DEV Community: Viktor Dorotovič

Partitioning InnoDB tables by time-based pseudo-sequential UUIDs

Case in point

Working example #1, but not particularly good

Working example #2, getting closer

A better solution — using UUID7

UUID7 < time

Partition boundaries

Splitting the many-to-many bridging table

Hiccups

Listing partitions in INFORMATION_SCHEMA.PARTITIONS

ORDER BY partitioning_col LIMIT N

Other DB Engines

UUIDs Are Bad for Database Index Performance, enter UUID7!

Representation & structure

UUID versions

UUID1 - time-based

UUID2 – DCE security version

UUID3 – name-based, MD5

UUID4 – random

UUID5 – name-based, SHA1

UUIDs and B-tree indices

UUID1 vs UUID4

Designing efficient UUIDs

UUID 6 – field-compatible version of UUID1

UUID 7 – time-ordered

UUID 8 – custom, vendor-specific

Testing & comparison

InnoDB – clustered index

PostgreSQL – non-clustered index

Conclusion

Choose the right primary key to save a large amount of disk I/O

1. Roll the dice and generate a completely random number

2. Use a counter and just increase the number by 1 for each book

Why are random identifiers even a thing?

Indexes in relational databases

1. Non-clustered index

2. Clustered index

Experiment and measure!

SQLite

Non-clustered index

Clustered index

Graphing writes per one row inserted

MySQL (InnoDB)

PostgreSQL

Conclusion

How much I/O can be saved?

UUIDs

Related Reading