Saulo José Benvenutti

Posted on Apr 16

storage_engine: Two High-Performance Table Access Methods for PostgreSQL Analytics and HTAP Workloads

#database #performance #postgres #showdev

After working with PostgreSQL's Table Access Method (TAM) API introduced in version 12, I built storage_engine — a PostgreSQL extension that ships two specialized storage engines as first-class AMs: colcompress for column-oriented analytics and rowcompress for append-heavy compressed workloads. Both coexist with standard heap tables in the same database, and both are available on PGXN.

This article is aimed at DBAs and database architects evaluating columnar or compressed storage for analytical and HTAP workloads within PostgreSQL, without leaving the ecosystem.

Lineage and Honest Attribution

storage_engine is a fork of Hydra Columnar, which is itself derived from citus_columnar — originally built by Citus Data, now part of Microsoft. The original code is copyright Citus Data / Hydra, licensed under AGPL-3.0. I extend those foundations with rowcompress, full DELETE/UPDATE support, MergeTree-style ordering, two-level zone-map pruning, and a redesigned parallel scan.

Ideas borrowed from other systems are acknowledged explicitly:

ClickHouse MergeTree: per-table orderby sort key and stripe/chunk-level zone-map pruning
Apache Parquet: row-group statistics, column projection, dictionary encoding
DuckDB: vectorized expression evaluation for columnar batches

I mention this upfront because the PostgreSQL community values transparency, and because it helps you understand what is new versus what is inherited.

Why Two Access Methods in One Extension?

Most workloads are not purely analytical or purely transactional. A common HTAP pattern involves:

Wide fact tables updated in bulk and queried with selective column projections → colcompress
High-volume append-only logs, audit trails, or time-series where all columns are read together → rowcompress

Having both AMs in a single extension means a single CREATE EXTENSION storage_engine installs both, all catalog objects land in the engine schema, and C symbols carry the se_ prefix to avoid conflicts with citus_columnar or any other columnar extension you may already have loaded.

CREATE EXTENSION storage_engine;

-- Column-oriented analytics table
CREATE TABLE events (
    ts         timestamptz NOT NULL,
    user_id    bigint,
    event_type text,
    value      float8
) USING colcompress;

-- Row-oriented compressed log table
CREATE TABLE audit_log (
    id        bigserial,
    logged_at timestamptz NOT NULL,
    message   text
) USING rowcompress;

Both tables live alongside heap tables in the same database. Joins between them and heap tables work normally through the standard PostgreSQL executor.

colcompress: Column-Oriented Storage with Vectorized Execution

Storage Layout

Data is stored column by column. Each column is divided into stripes (default 150,000 rows), and each stripe is subdivided into chunk groups (default 10,000 rows). Every chunk records its column's minimum and maximum value:

Table file
├── Stripe 1  (rows 1 – 150,000)
│   ├── Chunk group 0  (rows 1 – 10,000)
│   │   ├── Column A  [min, max, compressed values…]
│   │   ├── Column B  [min, max, compressed values…]
│   │   └── …
│   └── Chunk group 1  (rows 10,001 – 20,000)  …
└── Stripe 2  (rows 150,001 – 300,000)  …

A scan reads only the columns referenced by the query. On a 30-column table where a query touches 4 columns, the I/O reduction is roughly 87%. This is the fundamental advantage of columnar storage for analytics.

Compression Algorithms

Each chunk group is compressed independently. Available algorithms:

Name	Description
`none`	No compression
`lz4`	Fast compression/decompression, moderate ratio
`zstd`	High ratio, configurable level 1–19 (default: 3)
`pglz`	PostgreSQL's built-in LZ variant

SELECT engine.alter_colcompress_table_set('events'::regclass,
    compression       => 'zstd',
    compression_level => 9);

Two-Level Zone-Map Pruning

This is one of the most impactful features for DBAs managing large time-series or event tables.

Stripe-level pruning (coarse) — Before reading any data, the scan aggregates min/max across all chunks of each stripe and tests the resulting stripe-wide range against the query's WHERE predicates using PostgreSQL's predicate_refuted_by. Stripes provably disjoint from the predicate are skipped entirely — no decompression, no I/O. EXPLAIN reports this directly:

Custom Scan (ColumnarScan) on events
  Engine Stripes Removed by Pruning: 41
  Engine Stripes Read: 12

Chunk-level pruning (fine) — Within each stripe that survives the coarse pass, individual chunk groups are tested against the same predicate. Chunk groups whose min/max range cannot satisfy the predicate are skipped.

The two layers compose. On a large, well-sorted table, a date-range query eliminates entire stripes before touching them, then further prunes chunk groups within the survivors. The practical result is I/O amplification comparable to an index scan, without maintaining a B-tree.

Pruning effectiveness scales with data sortedness. This brings us to the MergeTree-inspired ordering mechanism.

MergeTree-Like Ordering

-- Define a global sort key
SELECT engine.alter_colcompress_table_set(
    'events'::regclass,
    orderby => 'ts ASC, user_id ASC'
);

-- After a bulk load, compact and globally sort
SELECT engine.colcompress_merge('events');

colcompress_merge copies all live rows to a temporary heap table, truncates the target, and re-inserts rows in the defined order — writing fresh, globally ordered stripes. After this, WHERE ts BETWEEN x AND y skips almost all chunks on typical time-series data.

DBA note: colcompress_merge acquires AccessExclusiveLock for the duration of the operation. Schedule it during a maintenance window for large tables. There is no online/concurrent mode.

Vectorized Execution

colcompress ships a vectorized expression evaluation engine that processes WHERE clauses and aggregates in column-oriented batches of up to 10,000 values per call. This eliminates per-row interpreter overhead and maps naturally onto column chunks.

Supported vectorized operations:

Category	Types
Comparison operators (`=`, `<>`, `<`, `<=`, `>`, `>=`)	`int2`, `int4`, `int8`, `float4`, `float8`, `date`, `timestamp`, `timestamptz`, `char`, `bpchar`, `text`, `varchar`, `name`, `bool`, `oid`
Aggregates (`count`, `sum`, `avg`, `max`, `min`)	`int2`, `int4`, `int8`, `float8`, `date`, `timestamptz`

Both features are on by default and can be toggled per session:

SET storage_engine.enable_vectorization = on;
SET storage_engine.enable_column_cache   = on;

Parallel Scan

The AM implements the full PostgreSQL parallel Table AM protocol via Dynamic Shared Memory (DSM). The coordinator divides the stripe range across workers; each worker independently reads and decompresses its assigned stripes. Parallel scan stacks on top of vectorized execution — each worker runs its own vectorized pipeline.

SET storage_engine.enable_parallel_execution = on;
SET storage_engine.min_parallel_processes    = 8;

-- Standard PostgreSQL parallel knobs also apply
SET max_parallel_workers_per_gather = 4;

Parallel scan and stripe pruning (v1.0.6+): Stripe pruning is active in both the sequential and parallel scan paths. The coordinator pre-filters stripe IDs against the query predicate before distributing work to workers, so each worker receives only stripes that survive the min/max test — no wasted I/O in parallel mode. This is confirmed by the benchmarks: Q5 (date-range, 1 month) achieves 22.4ms in serial and 28.2ms in parallel — both with 6 of 7 stripes eliminated by pruning.

DELETE, UPDATE, and Upserts

colcompress implements DELETE and UPDATE via a row mask stored in engine.row_mask. Each deleted row is recorded as a bit in a per-chunk-group bitmask; the scan engine skips masked rows without rewriting the stripe. UPDATE is delete-then-insert. Deleted rows are reclaimed during VACUUM, which rewrites affected stripes and clears the mask.

SET storage_engine.enable_dml = on;  -- default: on

DELETE FROM events WHERE ts < now() - interval '1 year';
UPDATE events SET value = value * 1.1 WHERE event_type = 'purchase';

-- Standard upserts also work, with a unique index on the conflict target
INSERT INTO events (ts, user_id, event_type, value)
VALUES (now(), 42, 'click', 1.0)
ON CONFLICT (user_id, event_type) DO UPDATE SET value = EXCLUDED.value;

Index-Backed Scan

For document repositories — tables storing XML, PDF, JSON blobs that need columnar compression but are fetched by primary key — an index scan path is available:

-- Enable per table (persisted across reconnects)
SELECT engine.alter_colcompress_table_set('documents'::regclass, index_scan => true);

-- Or per session for all colcompress tables
SET storage_engine.enable_columnar_index_scan = on;

The combination of zstd compression (3–10× ratio on large binary/text documents) and index-driven point lookups is compelling for document storage use cases.

Do not create B-tree indexes on columns covered by the orderby key on analytical tables. The PostgreSQL planner may prefer IndexScan for range queries, which bypasses stripe pruning entirely (randomAccess=true). Use GIN indexes for JSONB and array columns, and rely on stripe pruning for range predicates.

rowcompress: Batch-Compressed Row Storage

rowcompress stores rows in fixed-size batches (default 10,000 rows per batch). Each batch is serialized using heap tuple format and compressed as a single unit. Batch metadata — file offset, byte size, first row number, row count — is stored in engine.row_batch.

This AM suits append-heavy workloads where compression matters but column projection is not needed: event logs, audit trails, time-series tables where many or all columns are queried together. Typical storage savings are 2–10× with zstd.

Compared to colcompress:

Reads full rows (no column projection)
Lower write latency per row (no columnar transposition at write time)
No vectorized execution or chunk-level pruning
Parallel reads via atomic batch claiming (workers self-schedule, zero coordinator overhead)
Full compression algorithm support

SELECT engine.alter_rowcompress_table_set(
    'audit_log'::regclass,
    batch_size        => 10000,
    compression       => 'zstd',
    compression_level => 5
);

-- Rewrite all batches after changing compression options
SELECT engine.rowcompress_repack('audit_log');

Benchmarks

Benchmark suite: 1,000,000 rows, PostgreSQL 18.3, AMD Ryzen 7 5800H (8-core), 40 GB RAM, shared_buffers=10GB. colcompress configured with lz4 compression and orderby = 'event_date ASC' (globally sorted via colcompress_merge). Results are the median of 3 runs.

Serial (JIT=off, max_parallel_workers_per_gather=0)

Query	heap	colcompress	rowcompress	citus_columnar
Q1 `count(*)`	39.8ms	43.0ms	313ms	36.6ms
Q2 `SUM/AVG` numeric + double	188.6ms	117.4ms	358ms	122.9ms
Q3 `GROUP BY` country (10 vals)	219.0ms	162.0ms	395ms	139.4ms
Q4 `GROUP BY` event_type + p95	538.9ms	448.4ms	685ms	469.7ms
Q5 date range 1 month	20.8ms	22.4ms	59.1ms	20.6ms
Q6 JSONB `@>` GIN	123.1ms	162.2ms	326ms	238.1ms
Q7 JSONB key + GROUP BY	388.5ms	310.3ms	550ms	358.2ms
Q8 array `@>` GIN	63.0ms	122.7ms	274ms	140.9ms
Q9 LIKE text scan	150.9ms	90.9ms	338ms	89.9ms
Q10 heavy multi-agg	1953ms	1939ms	2109ms	1925ms

Q5 on colcompress achieves heap-equivalent performance (22.4ms vs. 20.8ms) because stripe pruning eliminates 6 of 7 stripes — data is physically sorted by event_date via orderby. lz4 decompression adds negligible overhead over the pruned data.

Q6 and Q8 (GIN index on JSONB/arrays) benefit from the GIN index without needing stripe pruning. This is expected behavior.

Parallel (JIT=on, max_parallel_workers_per_gather=16)

Query	heap	colcompress	rowcompress	citus_columnar
Q1 `count(*)`	18.3ms	16.4ms	148ms	37.9ms
Q2 `SUM/AVG` numeric + double	53.5ms	29.7ms	166ms	121.5ms
Q3 `GROUP BY` country	61.6ms	166ms	161ms	143ms
Q4 `GROUP BY` event_type + p95	540ms	316ms	674ms	470ms
Q5 date range 1 month	21.4ms	28.2ms	73.3ms	21.1ms
Q6 JSONB `@>` GIN	84.3ms	40.4ms	490ms	245ms
Q7 JSONB key + GROUP BY	392ms	65.7ms	687ms	362ms
Q8 array `@>` GIN	61.6ms	32.7ms	273ms	146ms
Q9 LIKE text scan	48.7ms	25.4ms	157ms	91.7ms
Q10 heavy multi-agg	1903ms	641ms	2085ms	1920ms

Q5 on colcompress (28.2ms) is now on par with the serial result (22.4ms) and heap (21.4ms). Since v1.0.6, the planner correctly applies disable_cost to parallel index paths, ensuring Parallel Custom Scan (ColcompressScan) with stripe pruning is chosen over Parallel Index Scan when index_scan=false.

The full benchmark kit is in tests/bench/. See BENCHMARKS.md for environment details and reproduction steps.

Known Limitations for DBAs

These are the limitations most relevant to production deployment decisions:

No AFTER ROW triggers or foreign keys. This is an architectural constraint of columnar storage. It also means pg_repack cannot be used — it relies on AFTER ROW triggers internally. Use engine.colcompress_repack() as a drop-in replacement.

No VACUUM FULL / table rewrite. Use engine.colcompress_repack() / engine.rowcompress_repack() instead.

No CLUSTER support. Use engine.colcompress_merge() with an orderby option to achieve equivalent physical ordering.

No unlogged tables. CREATE UNLOGGED TABLE ... USING colcompress is not supported.

Sort-on-write is disabled when B-tree indexes exist. Run engine.colcompress_merge() after bulk loads to re-establish global sort order.

colcompress_repack is not online. It acquires AccessExclusiveLock for the full duration. Schedule during maintenance windows.

AFTER STATEMENT triggers are supported. Only row-level (FOR EACH ROW) AFTER triggers are blocked.

Management Reference

Function	Description
`engine.alter_colcompress_table_set(regclass, ...)`	Set options on a colcompress table
`engine.alter_colcompress_table_reset(regclass, ...)`	Reset colcompress options to system defaults
`engine.colcompress_merge(regclass)`	Rewrite and globally sort a colcompress table by its `orderby` key
`engine.colcompress_repack(regclass)`	Alias for `colcompress_merge`; drop-in replacement for `pg_repack`
`engine.alter_rowcompress_table_set(regclass, ...)`	Set options on a rowcompress table
`engine.alter_rowcompress_table_reset(regclass, ...)`	Reset rowcompress options to system defaults
`engine.rowcompress_repack(regclass)`	Rewrite all batches with current options

Catalog views for inspection:

View	Description
`engine.colcompress_options`	Per-table options for all colcompress tables
`engine.colcompress_stripes`	Stripe-level metadata (offset, size, row range) per table
`engine.rowcompress_options`	Per-table options for all rowcompress tables
`engine.rowcompress_batches`	Batch-level metadata for all rowcompress tables

All views grant SELECT to PUBLIC.

Installation

Build from source

Requires PostgreSQL server headers and pg_config in PATH. Supports PostgreSQL 13–18.

cd dist/
sudo make -j$(nproc) install

Add to postgresql.conf:

shared_preload_libraries = 'storage_engine'

If citus or pg_cron are also in shared_preload_libraries, load order matters — citus must appear before storage_engine:

shared_preload_libraries = 'pg_cron,citus,storage_engine'

Then:

CREATE EXTENSION storage_engine;

License

AGPL-3.0. This is relevant for organizations that distribute modified versions or embed the extension in SaaS products — review the license terms accordingly.

DEV Community