When a table crosses tens of millions of rows, a full scan on every query becomes your biggest performance problem. Partitioning and clustering are the two techniques that fix it, and understanding the difference between them is what separates a data warehouse that stays fast at scale from one that doesn't.
Partitioning: Divide and Conquer
Partitioning splits a large table into smaller physical segments called partitions, based on the values in a specific column. The database engine can then skip entire partitions that don't match a query's filter, a technique called partition pruning.
The most common partition key in time-series data is a timestamp. Here's a range-partitioned table in PostgreSQL:
CREATE TABLE stock_prices_partitioned (
id BIGSERIAL,
symbol VARCHAR(20) NOT NULL,
close_price NUMERIC(12, 4),
volume BIGINT,
timestamp TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (timestamp);
CREATE TABLE stock_prices_2025_01
PARTITION OF stock_prices_partitioned
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE stock_prices_2025_02
PARTITION OF stock_prices_partitioned
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- Each partition gets its own index
CREATE INDEX ON stock_prices_2025_01 (symbol, timestamp DESC);
CREATE INDEX ON stock_prices_2025_02 (symbol, timestamp DESC);
PostgreSQL routes inserts to the correct partition automatically. A query filtering on January data touches only the January partition, not the full table.
Partition pruning in practice:
EXPLAIN ANALYZE
SELECT symbol, close_price
FROM stock_prices_partitioned
WHERE timestamp BETWEEN '2025-01-01' AND '2025-01-31';
-- Output will show: Partitions selected: 1 (of N)
Maintenance also becomes simpler. Archiving or dropping three years of old data is one statement:
-- Drops the entire partition instantly — no DELETE overhead, no locking the table
DROP TABLE stock_prices_2022_01;
Partitioning Strategies
Range partitioning is the most common in data pipelines. You divide rows by a continuous value range, typically a date or timestamp.
List partitioning divides rows by discrete values. Useful when you consistently filter by a known set of categories:
CREATE TABLE listings_partitioned (
id BIGSERIAL,
location TEXT NOT NULL,
price NUMERIC(14, 2),
status VARCHAR(20)
) PARTITION BY LIST (location);
CREATE TABLE listings_nairobi
PARTITION OF listings_partitioned
FOR VALUES IN ('Westlands', 'Kilimani', 'Karen', 'Lavington');
CREATE TABLE listings_mombasa
PARTITION OF listings_partitioned
FOR VALUES IN ('Nyali', 'Bamburi', 'Tudor');
Hash partitioning distributes rows evenly across N partitions based on a hash of the partition key. Useful when you have no natural range or list but want to spread I/O across partitions.
Clustering: Physical Data Ordering
Partitioning controls which segment of data a query reads. Clustering controls how data is physically ordered within that segment.
When rows are stored in the order you typically query them, the database reads them in one sequential sweep instead of jumping around the disk. That's the performance benefit.
Clustering in PostgreSQL
PostgreSQL's CLUSTER command rewrites a table so its physical row order matches a specific index:
-- Reorder stock_prices rows by (symbol, timestamp) — matches common query patterns
CLUSTER stock_prices USING idx_stock_sym_ts;
After clustering, queries that filter on symbol and timestamp perform fewer heap fetches. The downside is that CLUSTER locks the table while it runs and the physical order degrades over time as new rows are inserted in heap order.
For append-heavy time-series tables, a BRIN index (Block Range Index) achieves similar query speedups without the maintenance cost. BRIN works because recently inserted rows are physically adjacent on disk:
-- BRIN is tiny (a few kilobytes) but fast for naturally ordered columns
CREATE INDEX idx_stock_ts_brin ON stock_prices USING BRIN (timestamp);
Clustering in Cloud Data Warehouses
In BigQuery and Snowflake, clustering is a table property you declare at creation time. The warehouse manages physical ordering automatically:
-- BigQuery: partition by date, cluster by symbol for fast per-symbol queries
CREATE TABLE nse_pipeline.stock_prices
PARTITION BY DATE(timestamp)
CLUSTER BY symbol, sector;
You don't manage it manually. The engine re-clusters as data grows.
Partitioning vs. Clustering
| Feature | Partitioning | Clustering |
|---|---|---|
| What it does | Splits data into separate physical segments | Orders rows within a segment |
| Benefit | Skip entire partitions (data pruning) | Reduce random I/O within a scan |
| Best key | Date, region, category | Frequently filtered columns |
| Maintenance (PostgreSQL) | Add/drop partitions as needed | Re-run CLUSTER or use BRIN |
| Maintenance (Cloud DW) | Auto-managed | Auto-managed |
Using Both Together
The most effective setup combines them. Partition by date to skip months of data, then cluster by the column you filter within a single day:
-- PostgreSQL: partition by month, index by symbol within each month
CREATE TABLE stock_prices_2025_01
PARTITION OF stock_prices_partitioned
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE INDEX ON stock_prices_2025_01 (symbol, timestamp DESC);
A query for SCOM data in January 2025 hits one partition and uses one index. Without partitioning, it scans everything. Without the index, it scans the entire January partition row by row.
At scale, the combination is the difference between a query that finishes in milliseconds and one that runs for minutes.
Follow me on dev.to for more data engineering content, or browse the project code at github.com/declerke.
Top comments (0)