De' Clerke

Posted on Feb 4 • Edited on Jun 2

⚡ High-Performance Warehousing: Partitioning & Clustering

#partitioning #clustering #datawarehouse #database

When a table crosses tens of millions of rows, a full scan on every query becomes your biggest performance problem. Partitioning and clustering are the two techniques that fix it, and understanding the difference between them is what separates a data warehouse that stays fast at scale from one that doesn't.

Partitioning: Divide and Conquer

Partitioning splits a large table into smaller physical segments called partitions, based on the values in a specific column. The database engine can then skip entire partitions that don't match a query's filter, a technique called partition pruning.

The most common partition key in time-series data is a timestamp. Here's a range-partitioned table in PostgreSQL:

CREATE TABLE stock_prices_partitioned (
    id          BIGSERIAL,
    symbol      VARCHAR(20)  NOT NULL,
    close_price NUMERIC(12, 4),
    volume      BIGINT,
    timestamp   TIMESTAMPTZ  NOT NULL
) PARTITION BY RANGE (timestamp);

CREATE TABLE stock_prices_2025_01
    PARTITION OF stock_prices_partitioned
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

CREATE TABLE stock_prices_2025_02
    PARTITION OF stock_prices_partitioned
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

-- Each partition gets its own index
CREATE INDEX ON stock_prices_2025_01 (symbol, timestamp DESC);
CREATE INDEX ON stock_prices_2025_02 (symbol, timestamp DESC);

PostgreSQL routes inserts to the correct partition automatically. A query filtering on January data touches only the January partition, not the full table.

Partition pruning in practice:

EXPLAIN ANALYZE
SELECT symbol, close_price
FROM stock_prices_partitioned
WHERE timestamp BETWEEN '2025-01-01' AND '2025-01-31';
-- Output will show: Partitions selected: 1 (of N)

Maintenance also becomes simpler. Archiving or dropping three years of old data is one statement:

-- Drops the entire partition instantly — no DELETE overhead, no locking the table
DROP TABLE stock_prices_2022_01;

Partitioning Strategies

Range partitioning is the most common in data pipelines. You divide rows by a continuous value range, typically a date or timestamp.

List partitioning divides rows by discrete values. Useful when you consistently filter by a known set of categories:

CREATE TABLE listings_partitioned (
    id       BIGSERIAL,
    location TEXT NOT NULL,
    price    NUMERIC(14, 2),
    status   VARCHAR(20)
) PARTITION BY LIST (location);

CREATE TABLE listings_nairobi
    PARTITION OF listings_partitioned
    FOR VALUES IN ('Westlands', 'Kilimani', 'Karen', 'Lavington');

CREATE TABLE listings_mombasa
    PARTITION OF listings_partitioned
    FOR VALUES IN ('Nyali', 'Bamburi', 'Tudor');

Hash partitioning distributes rows evenly across N partitions based on a hash of the partition key. Useful when you have no natural range or list but want to spread I/O across partitions.

Clustering: Physical Data Ordering

Partitioning controls which segment of data a query reads. Clustering controls how data is physically ordered within that segment.

When rows are stored in the order you typically query them, the database reads them in one sequential sweep instead of jumping around the disk. That's the performance benefit.

Clustering in PostgreSQL

PostgreSQL's CLUSTER command rewrites a table so its physical row order matches a specific index:

-- Reorder stock_prices rows by (symbol, timestamp) — matches common query patterns
CLUSTER stock_prices USING idx_stock_sym_ts;

After clustering, queries that filter on symbol and timestamp perform fewer heap fetches. The downside is that CLUSTER locks the table while it runs and the physical order degrades over time as new rows are inserted in heap order.

For append-heavy time-series tables, a BRIN index (Block Range Index) achieves similar query speedups without the maintenance cost. BRIN works because recently inserted rows are physically adjacent on disk:

-- BRIN is tiny (a few kilobytes) but fast for naturally ordered columns
CREATE INDEX idx_stock_ts_brin ON stock_prices USING BRIN (timestamp);

Clustering in Cloud Data Warehouses

In BigQuery and Snowflake, clustering is a table property you declare at creation time. The warehouse manages physical ordering automatically:

-- BigQuery: partition by date, cluster by symbol for fast per-symbol queries
CREATE TABLE nse_pipeline.stock_prices
PARTITION BY DATE(timestamp)
CLUSTER BY symbol, sector;

You don't manage it manually. The engine re-clusters as data grows.

Partitioning vs. Clustering

Feature	Partitioning	Clustering
What it does	Splits data into separate physical segments	Orders rows within a segment
Benefit	Skip entire partitions (data pruning)	Reduce random I/O within a scan
Best key	Date, region, category	Frequently filtered columns
Maintenance (PostgreSQL)	Add/drop partitions as needed	Re-run `CLUSTER` or use BRIN
Maintenance (Cloud DW)	Auto-managed	Auto-managed

Using Both Together

The most effective setup combines them. Partition by date to skip months of data, then cluster by the column you filter within a single day:

-- PostgreSQL: partition by month, index by symbol within each month
CREATE TABLE stock_prices_2025_01
    PARTITION OF stock_prices_partitioned
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

CREATE INDEX ON stock_prices_2025_01 (symbol, timestamp DESC);

A query for SCOM data in January 2025 hits one partition and uses one index. Without partitioning, it scans everything. Without the index, it scans the entire January partition row by row.

At scale, the combination is the difference between a query that finishes in milliseconds and one that runs for minutes.

Follow me on dev.to for more data engineering content, or browse the project code at github.com/declerke.

DEV Community