Sadiul Hakim

Posted on Sep 26

MySQL Database Partitioning Tutorial

#mysql #partitioning #scaling

Database partitioning is a powerful technique for managing large tables in MySQL. Here is a comprehensive tutorial covering the key aspects of partitioning.

1. What is Partitioning?

Partitioning is a process that divides a large database table into smaller, more manageable pieces called partitions. Each partition is stored and handled as an individual table, but they are logically grouped together and treated as a single table by MySQL. This division is based on a set of rules (a partitioning function or scheme) that you define.

2. Why Use It and Why Not?

Why Use Partitioning? (Benefits)

Partitioning offers several significant benefits:

Benefit	Description
Improved Query Performance	Queries that involve the partitioning key can scan fewer partitions, dramatically reducing the amount of data the database has to read, a process called partition pruning.
Easier Data Management	You can quickly perform administrative tasks on parts of the data. For example, deleting old data is as fast as dropping an entire partition (a metadata operation), which is much faster than running a `DELETE` statement on millions of rows.
Increased Throughput	Parallel processing can be enabled, allowing multiple partitions to be processed simultaneously for some operations.
Data Locality	Related data is stored together, potentially improving cache usage and I/O performance.

Why Not Use Partitioning? (Drawbacks)

Drawback	Description
Increased Complexity	Designing, implementing, and maintaining a partitioned table is more complex than a normal table. Partitioning keys and functions must be carefully chosen.
No Cross-Partition Keys	All unique keys and primary keys on a partitioned table must include the partitioning columns.
Open File Limit	Each partition is stored as a separate file. A table with many partitions can hit the operating system's open file limit.
No Performance Gain	If queries do not filter on the partitioning key, they will have to scan all partitions, potentially making performance worse than a non-partitioned table due to the overhead.

3. When to Use It and When Not?

When to Use Partitioning

Very Large Tables (Millions/Billions of Rows): When tables exceed physical memory and I/O becomes a bottleneck.
Time-Series Data: When data is constantly growing based on time (e.g., log data, sensor readings), allowing old data to be easily purged by dropping date-based partitions.
Data Archival/Retention Policies: When different segments of data have different retention or storage requirements.
Table Scans are Slow: When many queries involve table scans that are slow and the query conditions align well with a partitioning strategy.

When Not to Use Partitioning

Small to Medium Tables: The overhead of partitioning outweighs any benefit.
Tables with Complex Joins: Partitioning does not directly improve join performance unless the join condition also aligns perfectly with the partitioning scheme (partition-wise joins).
Queries Do Not Use Partitioning Key: If most common queries do not filter on the column used for partitioning, partitioning provides no benefit and adds overhead.

4. Methods of Partitioning with Examples

MySQL supports four main types of partitioning, plus a composite method.

1. RANGE Partitioning (Best for Dates/Sequential Ranges)

Data is partitioned based on column values that fall within a given contiguous range. This is ideal for scenarios like time-series data or data with defined numerical ranges.

Syntax: PARTITION BY RANGE (expression)
Requirement: The expression must return an integer or NULL. Each partition definition uses VALUES LESS THAN (value). The values must be strictly increasing.

CREATE TABLE sales (
    sale_id INT NOT NULL,
    cust_id INT NOT NULL,
    store_id INT NOT NULL,
    sale_date DATE NOT NULL,
    amount DECIMAL (9,2)
)
PARTITION BY RANGE (YEARWEEK(sale_date))
(
    PARTITION s1 VALUES LESS THAN (202002), -- Sales up to the 2nd week of 2020
    PARTITION s2 VALUES LESS THAN (202003),
    PARTITION s3 VALUES LESS THAN (202004),
    PARTITION s4 VALUES LESS THAN (202005),
    PARTITION s5 VALUES LESS THAN (202006),
    PARTITION s999 VALUES LESS THAN (MAXVALUE) -- Catch-all for future data
);

-- Management Example: Reorganizing a partition
-- This example splits the s999 catch-all partition into new weeks and a new catch-all.
ALTER TABLE sales REORGANIZE PARTITION s999 INTO
(
    PARTITION s6 VALUES LESS THAN (202007),
    PARTITION s7 VALUES LESS THAN (202008),
    PARTITION s999 VALUES LESS THAN (MAXVALUE) -- New catch-all
);

2. LIST Partitioning (Best for Enumerated Columns)

Data is partitioned based on column values matching one of a discrete set of values specified for each partition.

Syntax: PARTITION BY LIST (column_name/expression)
Requirement: The column or expression must return an integer or a string. Each partition definition uses VALUES IN (list_of_values).

CREATE TABLE sales_list
(
    sale_id INT NOT NULL,
    cust_id INT NOT NULL,
    store_id INT NOT NULL,
    sale_date DATE NOT NULL,
    geo_region_cd VARCHAR(6) NOT NULL,
    amount DECIMAL(9,2)
)
PARTITION BY LIST COLUMNS (geo_region_cd) -- Use LIST COLUMNS for non-integer/string types
(
    PARTITION MIDDLEEAST VALUES IN ('SA','QT','SY'),
    PARTITION EUROPE VALUES IN ('EUR_I','EUR_F','EUR_B','EUR_S','EUR_P'),
    PARTITION ASIA VALUES IN ('PK','SG','IN','CH','RS')
    -- Note: Unlike RANGE, there is no MAXVALUE equivalent; all possible values must be listed
);

-- Management Example: Adding a new value to an existing partition
ALTER TABLE sales_list REORGANIZE PARTITION ASIA INTO
(
    PARTITION ASIA VALUES IN ('PK','SG','IN','CH','RS','NKOR')
);

3. HASH Partitioning (Best for Even Data Distribution)

Data is partitioned based on the result of a user-defined expression (usually a hash of a column value) and placed across a specified number of partitions. The goal is to distribute data evenly.

Syntax: PARTITION BY HASH (expression) PARTITIONS N
Requirement: The expression must return a non-negative integer.

CREATE TABLE sales_hash
(
    sale_id INT NOT NULL,
    cust_id INT NOT NULL,
    store_id INT NOT NULL,
    sale_date DATE NOT NULL,
    amount DECIMAL(9,2)
)
PARTITION BY HASH (cust_id) PARTITIONS 4;
-- Data rows are assigned to partitions H1, H2, H3, H4 based on the formula: MOD(HASH(cust_id), 4)

4. KEY Partitioning (Similar to HASH, but using a primary key)

Similar to HASH, but MySQL's internal hashing function is used, and the expression is usually based on one or more columns from the table's primary key or a unique key. It automatically uses all columns in the primary key if no columns are specified.

Syntax: PARTITION BY KEY (column_list) PARTITIONS N
Requirement: Columns specified in column_list must be part of a key.

5. COMPOSITE Partitioning (Subpartitioning)

This involves further subdividing each partition using a second partitioning scheme, typically HASH or KEY. It allows for finer data management control.

Example: Range-Hash composite partitioning. The main partitions are by Range (e.g., year), and each range partition is further broken down into Hash subpartitions (e.g., by customer ID).

CREATE TABLE sales_composite
(
    sale_id INT NOT NULL,
    cust_id INT NOT NULL,
    store_id INT NOT NULL,
    sale_date DATE NOT NULL,
    amount DECIMAL(9,2)
)
PARTITION BY RANGE (YEAR(sale_date))
SUBPARTITION BY HASH (cust_id)
(
    PARTITION s1 VALUES LESS THAN (2021)
    (
        SUBPARTITION s1_h1, -- Data from year < 2021, hashed to h1
        SUBPARTITION s1_h2,
        SUBPARTITION s1_h3,
        SUBPARTITION s1_h4
    ),
    -- ... (other range partitions)
    PARTITION s999 VALUES LESS THAN (MAXVALUE)
    (
        SUBPARTITION s999_h1,
        SUBPARTITION s999_h2,
        SUBPARTITION s999_h3,
        SUBPARTITION s999_h4
    )
);

5. When to Use Which Method?

Partitioning Method	Best Use Case	Query Type to Benefit
RANGE	Time-series data, historical data, data that is naturally segmented by sequential ranges (e.g., dates, numeric ranges).	Queries filtering by date or range, and data retention/archival.
LIST	Data with a small number of discrete, non-contiguous values (e.g., region codes, fixed statuses).	Queries filtering by specific discrete values.
HASH/KEY	Large tables where the primary goal is to distribute data evenly across partitions for better parallelization and reduced contention.	Queries that filter by the hash/key column(s), or bulk insert/update operations.
COMPOSITE	When you need the benefits of two methods, such as easily dropping old data (Range) and evenly distributing data within a range (Hash).	Queries that filter on the outer (Range) key for pruning, and operations that benefit from intra-partition balancing.

6. How Partitioning Improves Performance

Partitioning primarily improves performance through two mechanisms:

A. Partition Pruning

When a query is executed, the MySQL optimizer analyzes the WHERE clause relative to the partitioning function. If the condition allows, the optimizer can exclude irrelevant partitions from the scan, a process called partition pruning.

Example:
In the sales table partitioned by YEARWEEK(sale_date), a query like:
SELECT * FROM sales WHERE sale_date BETWEEN '2020-02-01' AND '2020-02-29';
The optimizer can calculate the range of week numbers (e.g., 202005 to 202009) and only scan partitions $s4, s5, s6, s7, s999$. It ignores partitions $s1, s2, s3$, dramatically reducing I/O and increasing query speed.

B. Efficient Data Management Operations

Operations that affect large portions of the table, especially data deletion and archiving, become highly efficient. Instead of running a lengthy DELETE query, you can use ALTER TABLE ... DROP PARTITION pX;, which is near-instantaneous as it is a metadata operation that simply removes the files associated with the partition.

DEV Community