DEV Community

Saulo José Benvenutti
Saulo José Benvenutti

Posted on

storage_engine: Two High-Performance Table Access Methods for PostgreSQL Analytics and HTAP Workloads

After working with PostgreSQL's Table Access Method (TAM) API introduced in version 12, I built storage_engine — a PostgreSQL extension that ships two specialized storage engines as first-class AMs: colcompress for column-oriented analytics and rowcompress for append-heavy compressed workloads. Both coexist with standard heap tables in the same database, and both are available on PGXN.

This article is aimed at DBAs and database architects evaluating columnar or compressed storage for analytical and HTAP workloads within PostgreSQL, without leaving the ecosystem.


Lineage and Honest Attribution

storage_engine is a fork of Hydra Columnar, which is itself derived from citus_columnar — originally built by Citus Data, now part of Microsoft. The original code is copyright Citus Data / Hydra, licensed under AGPL-3.0. I extend those foundations with rowcompress, full DELETE/UPDATE support, MergeTree-style ordering, two-level zone-map pruning, and a redesigned parallel scan.

Ideas borrowed from other systems are acknowledged explicitly:

  • ClickHouse MergeTree: per-table orderby sort key and stripe/chunk-level zone-map pruning
  • Apache Parquet: row-group statistics, column projection, dictionary encoding
  • DuckDB: vectorized expression evaluation for columnar batches

I mention this upfront because the PostgreSQL community values transparency, and because it helps you understand what is new versus what is inherited.


Why Two Access Methods in One Extension?

Most workloads are not purely analytical or purely transactional. A common HTAP pattern involves:

  • Wide fact tables updated in bulk and queried with selective column projections → colcompress
  • High-volume append-only logs, audit trails, or time-series where all columns are read together → rowcompress

Having both AMs in a single extension means a single CREATE EXTENSION storage_engine installs both, all catalog objects land in the engine schema, and C symbols carry the se_ prefix to avoid conflicts with citus_columnar or any other columnar extension you may already have loaded.

CREATE EXTENSION storage_engine;

-- Column-oriented analytics table
CREATE TABLE events (
    ts         timestamptz NOT NULL,
    user_id    bigint,
    event_type text,
    value      float8
) USING colcompress;

-- Row-oriented compressed log table
CREATE TABLE audit_log (
    id        bigserial,
    logged_at timestamptz NOT NULL,
    message   text
) USING rowcompress;
Enter fullscreen mode Exit fullscreen mode

Both tables live alongside heap tables in the same database. Joins between them and heap tables work normally through the standard PostgreSQL executor.


colcompress: Column-Oriented Storage with Vectorized Execution

Storage Layout

Data is stored column by column. Each column is divided into stripes (default 150,000 rows), and each stripe is subdivided into chunk groups (default 10,000 rows). Every chunk records its column's minimum and maximum value:

Table file
├── Stripe 1  (rows 1 – 150,000)
│   ├── Chunk group 0  (rows 1 – 10,000)
│   │   ├── Column A  [min, max, compressed values…]
│   │   ├── Column B  [min, max, compressed values…]
│   │   └── …
│   └── Chunk group 1  (rows 10,001 – 20,000)  …
└── Stripe 2  (rows 150,001 – 300,000)  …
Enter fullscreen mode Exit fullscreen mode

A scan reads only the columns referenced by the query. On a 30-column table where a query touches 4 columns, the I/O reduction is roughly 87%. This is the fundamental advantage of columnar storage for analytics.

Compression Algorithms

Each chunk group is compressed independently. Available algorithms:

Name Description
none No compression
lz4 Fast compression/decompression, moderate ratio
zstd High ratio, configurable level 1–19 (default: 3)
pglz PostgreSQL's built-in LZ variant
SELECT engine.alter_colcompress_table_set('events'::regclass,
    compression       => 'zstd',
    compression_level => 9);
Enter fullscreen mode Exit fullscreen mode

Two-Level Zone-Map Pruning

This is one of the most impactful features for DBAs managing large time-series or event tables.

Stripe-level pruning (coarse) — Before reading any data, the scan aggregates min/max across all chunks of each stripe and tests the resulting stripe-wide range against the query's WHERE predicates using PostgreSQL's predicate_refuted_by. Stripes provably disjoint from the predicate are skipped entirely — no decompression, no I/O. EXPLAIN reports this directly:

Custom Scan (ColumnarScan) on events
  Engine Stripes Removed by Pruning: 41
  Engine Stripes Read: 12
Enter fullscreen mode Exit fullscreen mode

Chunk-level pruning (fine) — Within each stripe that survives the coarse pass, individual chunk groups are tested against the same predicate. Chunk groups whose min/max range cannot satisfy the predicate are skipped.

The two layers compose. On a large, well-sorted table, a date-range query eliminates entire stripes before touching them, then further prunes chunk groups within the survivors. The practical result is I/O amplification comparable to an index scan, without maintaining a B-tree.

Pruning effectiveness scales with data sortedness. This brings us to the MergeTree-inspired ordering mechanism.

MergeTree-Like Ordering

-- Define a global sort key
SELECT engine.alter_colcompress_table_set(
    'events'::regclass,
    orderby => 'ts ASC, user_id ASC'
);

-- After a bulk load, compact and globally sort
SELECT engine.colcompress_merge('events');
Enter fullscreen mode Exit fullscreen mode

colcompress_merge copies all live rows to a temporary heap table, truncates the target, and re-inserts rows in the defined order — writing fresh, globally ordered stripes. After this, WHERE ts BETWEEN x AND y skips almost all chunks on typical time-series data.

DBA note: colcompress_merge acquires AccessExclusiveLock for the duration of the operation. Schedule it during a maintenance window for large tables. There is no online/concurrent mode.

Vectorized Execution

colcompress ships a vectorized expression evaluation engine that processes WHERE clauses and aggregates in column-oriented batches of up to 10,000 values per call. This eliminates per-row interpreter overhead and maps naturally onto column chunks.

Supported vectorized operations:

Category Types
Comparison operators (=, <>, <, <=, >, >=) int2, int4, int8, float4, float8, date, timestamp, timestamptz, char, bpchar, text, varchar, name, bool, oid
Aggregates (count, sum, avg, max, min) int2, int4, int8, float8, date, timestamptz

Both features are on by default and can be toggled per session:

SET storage_engine.enable_vectorization = on;
SET storage_engine.enable_column_cache   = on;
Enter fullscreen mode Exit fullscreen mode

Parallel Scan

The AM implements the full PostgreSQL parallel Table AM protocol via Dynamic Shared Memory (DSM). The coordinator divides the stripe range across workers; each worker independently reads and decompresses its assigned stripes. Parallel scan stacks on top of vectorized execution — each worker runs its own vectorized pipeline.

SET storage_engine.enable_parallel_execution = on;
SET storage_engine.min_parallel_processes    = 8;

-- Standard PostgreSQL parallel knobs also apply
SET max_parallel_workers_per_gather = 4;
Enter fullscreen mode Exit fullscreen mode

Parallel scan and stripe pruning (v1.0.6+): Stripe pruning is active in both the sequential and parallel scan paths. The coordinator pre-filters stripe IDs against the query predicate before distributing work to workers, so each worker receives only stripes that survive the min/max test — no wasted I/O in parallel mode. This is confirmed by the benchmarks: Q5 (date-range, 1 month) achieves 22.4ms in serial and 28.2ms in parallel — both with 6 of 7 stripes eliminated by pruning.

DELETE, UPDATE, and Upserts

colcompress implements DELETE and UPDATE via a row mask stored in engine.row_mask. Each deleted row is recorded as a bit in a per-chunk-group bitmask; the scan engine skips masked rows without rewriting the stripe. UPDATE is delete-then-insert. Deleted rows are reclaimed during VACUUM, which rewrites affected stripes and clears the mask.

SET storage_engine.enable_dml = on;  -- default: on

DELETE FROM events WHERE ts < now() - interval '1 year';
UPDATE events SET value = value * 1.1 WHERE event_type = 'purchase';

-- Standard upserts also work, with a unique index on the conflict target
INSERT INTO events (ts, user_id, event_type, value)
VALUES (now(), 42, 'click', 1.0)
ON CONFLICT (user_id, event_type) DO UPDATE SET value = EXCLUDED.value;
Enter fullscreen mode Exit fullscreen mode

Index-Backed Scan

For document repositories — tables storing XML, PDF, JSON blobs that need columnar compression but are fetched by primary key — an index scan path is available:

-- Enable per table (persisted across reconnects)
SELECT engine.alter_colcompress_table_set('documents'::regclass, index_scan => true);

-- Or per session for all colcompress tables
SET storage_engine.enable_columnar_index_scan = on;
Enter fullscreen mode Exit fullscreen mode

The combination of zstd compression (3–10× ratio on large binary/text documents) and index-driven point lookups is compelling for document storage use cases.

Do not create B-tree indexes on columns covered by the orderby key on analytical tables. The PostgreSQL planner may prefer IndexScan for range queries, which bypasses stripe pruning entirely (randomAccess=true). Use GIN indexes for JSONB and array columns, and rely on stripe pruning for range predicates.


rowcompress: Batch-Compressed Row Storage

rowcompress stores rows in fixed-size batches (default 10,000 rows per batch). Each batch is serialized using heap tuple format and compressed as a single unit. Batch metadata — file offset, byte size, first row number, row count — is stored in engine.row_batch.

This AM suits append-heavy workloads where compression matters but column projection is not needed: event logs, audit trails, time-series tables where many or all columns are queried together. Typical storage savings are 2–10× with zstd.

Compared to colcompress:

  • Reads full rows (no column projection)
  • Lower write latency per row (no columnar transposition at write time)
  • No vectorized execution or chunk-level pruning
  • Parallel reads via atomic batch claiming (workers self-schedule, zero coordinator overhead)
  • Full compression algorithm support
SELECT engine.alter_rowcompress_table_set(
    'audit_log'::regclass,
    batch_size        => 10000,
    compression       => 'zstd',
    compression_level => 5
);

-- Rewrite all batches after changing compression options
SELECT engine.rowcompress_repack('audit_log');
Enter fullscreen mode Exit fullscreen mode

Benchmarks

Benchmark suite: 1,000,000 rows, PostgreSQL 18.3, AMD Ryzen 7 5800H (8-core), 40 GB RAM, shared_buffers=10GB. colcompress configured with lz4 compression and orderby = 'event_date ASC' (globally sorted via colcompress_merge). Results are the median of 3 runs.

Serial (JIT=off, max_parallel_workers_per_gather=0)

Serial benchmark chart

Query heap colcompress rowcompress citus_columnar
Q1 count(*) 39.8ms 43.0ms 313ms 36.6ms
Q2 SUM/AVG numeric + double 188.6ms 117.4ms 358ms 122.9ms
Q3 GROUP BY country (10 vals) 219.0ms 162.0ms 395ms 139.4ms
Q4 GROUP BY event_type + p95 538.9ms 448.4ms 685ms 469.7ms
Q5 date range 1 month 20.8ms 22.4ms 59.1ms 20.6ms
Q6 JSONB @> GIN 123.1ms 162.2ms 326ms 238.1ms
Q7 JSONB key + GROUP BY 388.5ms 310.3ms 550ms 358.2ms
Q8 array @> GIN 63.0ms 122.7ms 274ms 140.9ms
Q9 LIKE text scan 150.9ms 90.9ms 338ms 89.9ms
Q10 heavy multi-agg 1953ms 1939ms 2109ms 1925ms

Q5 on colcompress achieves heap-equivalent performance (22.4ms vs. 20.8ms) because stripe pruning eliminates 6 of 7 stripes — data is physically sorted by event_date via orderby. lz4 decompression adds negligible overhead over the pruned data.

Q6 and Q8 (GIN index on JSONB/arrays) benefit from the GIN index without needing stripe pruning. This is expected behavior.

Parallel (JIT=on, max_parallel_workers_per_gather=16)

Parallel benchmark chart

Query heap colcompress rowcompress citus_columnar
Q1 count(*) 18.3ms 16.4ms 148ms 37.9ms
Q2 SUM/AVG numeric + double 53.5ms 29.7ms 166ms 121.5ms
Q3 GROUP BY country 61.6ms 166ms 161ms 143ms
Q4 GROUP BY event_type + p95 540ms 316ms 674ms 470ms
Q5 date range 1 month 21.4ms 28.2ms 73.3ms 21.1ms
Q6 JSONB @> GIN 84.3ms 40.4ms 490ms 245ms
Q7 JSONB key + GROUP BY 392ms 65.7ms 687ms 362ms
Q8 array @> GIN 61.6ms 32.7ms 273ms 146ms
Q9 LIKE text scan 48.7ms 25.4ms 157ms 91.7ms
Q10 heavy multi-agg 1903ms 641ms 2085ms 1920ms

Q5 on colcompress (28.2ms) is now on par with the serial result (22.4ms) and heap (21.4ms). Since v1.0.6, the planner correctly applies disable_cost to parallel index paths, ensuring Parallel Custom Scan (ColcompressScan) with stripe pruning is chosen over Parallel Index Scan when index_scan=false.

The full benchmark kit is in tests/bench/. See BENCHMARKS.md for environment details and reproduction steps.


Known Limitations for DBAs

These are the limitations most relevant to production deployment decisions:

No AFTER ROW triggers or foreign keys. This is an architectural constraint of columnar storage. It also means pg_repack cannot be used — it relies on AFTER ROW triggers internally. Use engine.colcompress_repack() as a drop-in replacement.

No VACUUM FULL / table rewrite. Use engine.colcompress_repack() / engine.rowcompress_repack() instead.

No CLUSTER support. Use engine.colcompress_merge() with an orderby option to achieve equivalent physical ordering.

No unlogged tables. CREATE UNLOGGED TABLE ... USING colcompress is not supported.

Sort-on-write is disabled when B-tree indexes exist. Run engine.colcompress_merge() after bulk loads to re-establish global sort order.

colcompress_repack is not online. It acquires AccessExclusiveLock for the full duration. Schedule during maintenance windows.

AFTER STATEMENT triggers are supported. Only row-level (FOR EACH ROW) AFTER triggers are blocked.


Management Reference

Function Description
engine.alter_colcompress_table_set(regclass, ...) Set options on a colcompress table
engine.alter_colcompress_table_reset(regclass, ...) Reset colcompress options to system defaults
engine.colcompress_merge(regclass) Rewrite and globally sort a colcompress table by its orderby key
engine.colcompress_repack(regclass) Alias for colcompress_merge; drop-in replacement for pg_repack
engine.alter_rowcompress_table_set(regclass, ...) Set options on a rowcompress table
engine.alter_rowcompress_table_reset(regclass, ...) Reset rowcompress options to system defaults
engine.rowcompress_repack(regclass) Rewrite all batches with current options

Catalog views for inspection:

View Description
engine.colcompress_options Per-table options for all colcompress tables
engine.colcompress_stripes Stripe-level metadata (offset, size, row range) per table
engine.rowcompress_options Per-table options for all rowcompress tables
engine.rowcompress_batches Batch-level metadata for all rowcompress tables

All views grant SELECT to PUBLIC.


Installation

Build from source

Requires PostgreSQL server headers and pg_config in PATH. Supports PostgreSQL 13–18.

cd dist/
sudo make -j$(nproc) install
Enter fullscreen mode Exit fullscreen mode

Add to postgresql.conf:

shared_preload_libraries = 'storage_engine'
Enter fullscreen mode Exit fullscreen mode

If citus or pg_cron are also in shared_preload_libraries, load order matters — citus must appear before storage_engine:

shared_preload_libraries = 'pg_cron,citus,storage_engine'
Enter fullscreen mode Exit fullscreen mode

Then:

CREATE EXTENSION storage_engine;
Enter fullscreen mode Exit fullscreen mode

License

AGPL-3.0. This is relevant for organizations that distribute modified versions or embed the extension in SaaS products — review the license terms accordingly.


Links

Feedback, issues, and pull requests are welcome on GitHub.


Saulo José Benvenutti — Data Architect / PostgreSQL DBA
📧 saulojb@gmail.com · 🔗 github.com/saulojb

Top comments (0)