After working with PostgreSQL's Table Access Method (TAM) API introduced in version 12, I built storage_engine — a PostgreSQL extension that ships two specialized storage engines as first-class AMs: colcompress for column-oriented analytics and rowcompress for append-heavy compressed workloads. Both coexist with standard heap tables in the same database, and both are available on PGXN.
This article is aimed at DBAs and database architects evaluating columnar or compressed storage for analytical and HTAP workloads within PostgreSQL, without leaving the ecosystem.
Lineage and Honest Attribution
storage_engine is a fork of Hydra Columnar, which is itself derived from citus_columnar — originally built by Citus Data, now part of Microsoft. The original code is copyright Citus Data / Hydra, licensed under AGPL-3.0. I extend those foundations with rowcompress, full DELETE/UPDATE support, MergeTree-style ordering, two-level zone-map pruning, and a redesigned parallel scan.
Ideas borrowed from other systems are acknowledged explicitly:
-
ClickHouse MergeTree: per-table
orderbysort key and stripe/chunk-level zone-map pruning - Apache Parquet: row-group statistics, column projection, dictionary encoding
- DuckDB: vectorized expression evaluation for columnar batches
I mention this upfront because the PostgreSQL community values transparency, and because it helps you understand what is new versus what is inherited.
Why Two Access Methods in One Extension?
Most workloads are not purely analytical or purely transactional. A common HTAP pattern involves:
- Wide fact tables updated in bulk and queried with selective column projections → colcompress
- High-volume append-only logs, audit trails, or time-series where all columns are read together → rowcompress
Having both AMs in a single extension means a single CREATE EXTENSION storage_engine installs both, all catalog objects land in the engine schema, and C symbols carry the se_ prefix to avoid conflicts with citus_columnar or any other columnar extension you may already have loaded.
CREATE EXTENSION storage_engine;
-- Column-oriented analytics table
CREATE TABLE events (
ts timestamptz NOT NULL,
user_id bigint,
event_type text,
value float8
) USING colcompress;
-- Row-oriented compressed log table
CREATE TABLE audit_log (
id bigserial,
logged_at timestamptz NOT NULL,
message text
) USING rowcompress;
Both tables live alongside heap tables in the same database. Joins between them and heap tables work normally through the standard PostgreSQL executor.
colcompress: Column-Oriented Storage with Vectorized Execution
Storage Layout
Data is stored column by column. Each column is divided into stripes (default 150,000 rows), and each stripe is subdivided into chunk groups (default 10,000 rows). Every chunk records its column's minimum and maximum value:
Table file
├── Stripe 1 (rows 1 – 150,000)
│ ├── Chunk group 0 (rows 1 – 10,000)
│ │ ├── Column A [min, max, compressed values…]
│ │ ├── Column B [min, max, compressed values…]
│ │ └── …
│ └── Chunk group 1 (rows 10,001 – 20,000) …
└── Stripe 2 (rows 150,001 – 300,000) …
A scan reads only the columns referenced by the query. On a 30-column table where a query touches 4 columns, the I/O reduction is roughly 87%. This is the fundamental advantage of columnar storage for analytics.
Compression Algorithms
Each chunk group is compressed independently. Available algorithms:
| Name | Description |
|---|---|
none |
No compression |
lz4 |
Fast compression/decompression, moderate ratio |
zstd |
High ratio, configurable level 1–19 (default: 3) |
pglz |
PostgreSQL's built-in LZ variant |
SELECT engine.alter_colcompress_table_set('events'::regclass,
compression => 'zstd',
compression_level => 9);
Two-Level Zone-Map Pruning
This is one of the most impactful features for DBAs managing large time-series or event tables.
Stripe-level pruning (coarse) — Before reading any data, the scan aggregates min/max across all chunks of each stripe and tests the resulting stripe-wide range against the query's WHERE predicates using PostgreSQL's predicate_refuted_by. Stripes provably disjoint from the predicate are skipped entirely — no decompression, no I/O. EXPLAIN reports this directly:
Custom Scan (ColumnarScan) on events
Engine Stripes Removed by Pruning: 41
Engine Stripes Read: 12
Chunk-level pruning (fine) — Within each stripe that survives the coarse pass, individual chunk groups are tested against the same predicate. Chunk groups whose min/max range cannot satisfy the predicate are skipped.
The two layers compose. On a large, well-sorted table, a date-range query eliminates entire stripes before touching them, then further prunes chunk groups within the survivors. The practical result is I/O amplification comparable to an index scan, without maintaining a B-tree.
Pruning effectiveness scales with data sortedness. This brings us to the MergeTree-inspired ordering mechanism.
MergeTree-Like Ordering
-- Define a global sort key
SELECT engine.alter_colcompress_table_set(
'events'::regclass,
orderby => 'ts ASC, user_id ASC'
);
-- After a bulk load, compact and globally sort
SELECT engine.colcompress_merge('events');
colcompress_merge copies all live rows to a temporary heap table, truncates the target, and re-inserts rows in the defined order — writing fresh, globally ordered stripes. After this, WHERE ts BETWEEN x AND y skips almost all chunks on typical time-series data.
DBA note:
colcompress_mergeacquiresAccessExclusiveLockfor the duration of the operation. Schedule it during a maintenance window for large tables. There is no online/concurrent mode.
Vectorized Execution
colcompress ships a vectorized expression evaluation engine that processes WHERE clauses and aggregates in column-oriented batches of up to 10,000 values per call. This eliminates per-row interpreter overhead and maps naturally onto column chunks.
Supported vectorized operations:
| Category | Types |
|---|---|
Comparison operators (=, <>, <, <=, >, >=) |
int2, int4, int8, float4, float8, date, timestamp, timestamptz, char, bpchar, text, varchar, name, bool, oid
|
Aggregates (count, sum, avg, max, min) |
int2, int4, int8, float8, date, timestamptz
|
Both features are on by default and can be toggled per session:
SET storage_engine.enable_vectorization = on;
SET storage_engine.enable_column_cache = on;
Parallel Scan
The AM implements the full PostgreSQL parallel Table AM protocol via Dynamic Shared Memory (DSM). The coordinator divides the stripe range across workers; each worker independently reads and decompresses its assigned stripes. Parallel scan stacks on top of vectorized execution — each worker runs its own vectorized pipeline.
SET storage_engine.enable_parallel_execution = on;
SET storage_engine.min_parallel_processes = 8;
-- Standard PostgreSQL parallel knobs also apply
SET max_parallel_workers_per_gather = 4;
Parallel scan and stripe pruning (v1.0.6+): Stripe pruning is active in both the sequential and parallel scan paths. The coordinator pre-filters stripe IDs against the query predicate before distributing work to workers, so each worker receives only stripes that survive the min/max test — no wasted I/O in parallel mode. This is confirmed by the benchmarks: Q5 (date-range, 1 month) achieves 22.4ms in serial and 28.2ms in parallel — both with 6 of 7 stripes eliminated by pruning.
DELETE, UPDATE, and Upserts
colcompress implements DELETE and UPDATE via a row mask stored in engine.row_mask. Each deleted row is recorded as a bit in a per-chunk-group bitmask; the scan engine skips masked rows without rewriting the stripe. UPDATE is delete-then-insert. Deleted rows are reclaimed during VACUUM, which rewrites affected stripes and clears the mask.
SET storage_engine.enable_dml = on; -- default: on
DELETE FROM events WHERE ts < now() - interval '1 year';
UPDATE events SET value = value * 1.1 WHERE event_type = 'purchase';
-- Standard upserts also work, with a unique index on the conflict target
INSERT INTO events (ts, user_id, event_type, value)
VALUES (now(), 42, 'click', 1.0)
ON CONFLICT (user_id, event_type) DO UPDATE SET value = EXCLUDED.value;
Index-Backed Scan
For document repositories — tables storing XML, PDF, JSON blobs that need columnar compression but are fetched by primary key — an index scan path is available:
-- Enable per table (persisted across reconnects)
SELECT engine.alter_colcompress_table_set('documents'::regclass, index_scan => true);
-- Or per session for all colcompress tables
SET storage_engine.enable_columnar_index_scan = on;
The combination of zstd compression (3–10× ratio on large binary/text documents) and index-driven point lookups is compelling for document storage use cases.
Do not create B-tree indexes on columns covered by the orderby key on analytical tables. The PostgreSQL planner may prefer IndexScan for range queries, which bypasses stripe pruning entirely (randomAccess=true). Use GIN indexes for JSONB and array columns, and rely on stripe pruning for range predicates.
rowcompress: Batch-Compressed Row Storage
rowcompress stores rows in fixed-size batches (default 10,000 rows per batch). Each batch is serialized using heap tuple format and compressed as a single unit. Batch metadata — file offset, byte size, first row number, row count — is stored in engine.row_batch.
This AM suits append-heavy workloads where compression matters but column projection is not needed: event logs, audit trails, time-series tables where many or all columns are queried together. Typical storage savings are 2–10× with zstd.
Compared to colcompress:
- Reads full rows (no column projection)
- Lower write latency per row (no columnar transposition at write time)
- No vectorized execution or chunk-level pruning
- Parallel reads via atomic batch claiming (workers self-schedule, zero coordinator overhead)
- Full compression algorithm support
SELECT engine.alter_rowcompress_table_set(
'audit_log'::regclass,
batch_size => 10000,
compression => 'zstd',
compression_level => 5
);
-- Rewrite all batches after changing compression options
SELECT engine.rowcompress_repack('audit_log');
Benchmarks
Benchmark suite: 1,000,000 rows, PostgreSQL 18.3, AMD Ryzen 7 5800H (8-core), 40 GB RAM, shared_buffers=10GB. colcompress configured with lz4 compression and orderby = 'event_date ASC' (globally sorted via colcompress_merge). Results are the median of 3 runs.
Serial (JIT=off, max_parallel_workers_per_gather=0)
| Query | heap | colcompress | rowcompress | citus_columnar |
|---|---|---|---|---|
Q1 count(*)
|
39.8ms | 43.0ms | 313ms | 36.6ms |
Q2 SUM/AVG numeric + double |
188.6ms | 117.4ms | 358ms | 122.9ms |
Q3 GROUP BY country (10 vals) |
219.0ms | 162.0ms | 395ms | 139.4ms |
Q4 GROUP BY event_type + p95 |
538.9ms | 448.4ms | 685ms | 469.7ms |
| Q5 date range 1 month | 20.8ms | 22.4ms | 59.1ms | 20.6ms |
Q6 JSONB @> GIN |
123.1ms | 162.2ms | 326ms | 238.1ms |
| Q7 JSONB key + GROUP BY | 388.5ms | 310.3ms | 550ms | 358.2ms |
Q8 array @> GIN |
63.0ms | 122.7ms | 274ms | 140.9ms |
| Q9 LIKE text scan | 150.9ms | 90.9ms | 338ms | 89.9ms |
| Q10 heavy multi-agg | 1953ms | 1939ms | 2109ms | 1925ms |
Q5 on colcompress achieves heap-equivalent performance (22.4ms vs. 20.8ms) because stripe pruning eliminates 6 of 7 stripes — data is physically sorted by event_date via orderby. lz4 decompression adds negligible overhead over the pruned data.
Q6 and Q8 (GIN index on JSONB/arrays) benefit from the GIN index without needing stripe pruning. This is expected behavior.
Parallel (JIT=on, max_parallel_workers_per_gather=16)
| Query | heap | colcompress | rowcompress | citus_columnar |
|---|---|---|---|---|
Q1 count(*)
|
18.3ms | 16.4ms | 148ms | 37.9ms |
Q2 SUM/AVG numeric + double |
53.5ms | 29.7ms | 166ms | 121.5ms |
Q3 GROUP BY country |
61.6ms | 166ms | 161ms | 143ms |
Q4 GROUP BY event_type + p95 |
540ms | 316ms | 674ms | 470ms |
| Q5 date range 1 month | 21.4ms | 28.2ms | 73.3ms | 21.1ms |
Q6 JSONB @> GIN |
84.3ms | 40.4ms | 490ms | 245ms |
| Q7 JSONB key + GROUP BY | 392ms | 65.7ms | 687ms | 362ms |
Q8 array @> GIN |
61.6ms | 32.7ms | 273ms | 146ms |
| Q9 LIKE text scan | 48.7ms | 25.4ms | 157ms | 91.7ms |
| Q10 heavy multi-agg | 1903ms | 641ms | 2085ms | 1920ms |
Q5 on colcompress (28.2ms) is now on par with the serial result (22.4ms) and heap (21.4ms). Since v1.0.6, the planner correctly applies disable_cost to parallel index paths, ensuring Parallel Custom Scan (ColcompressScan) with stripe pruning is chosen over Parallel Index Scan when index_scan=false.
The full benchmark kit is in tests/bench/. See BENCHMARKS.md for environment details and reproduction steps.
Known Limitations for DBAs
These are the limitations most relevant to production deployment decisions:
No AFTER ROW triggers or foreign keys. This is an architectural constraint of columnar storage. It also means pg_repack cannot be used — it relies on AFTER ROW triggers internally. Use engine.colcompress_repack() as a drop-in replacement.
No VACUUM FULL / table rewrite. Use engine.colcompress_repack() / engine.rowcompress_repack() instead.
No CLUSTER support. Use engine.colcompress_merge() with an orderby option to achieve equivalent physical ordering.
No unlogged tables. CREATE UNLOGGED TABLE ... USING colcompress is not supported.
Sort-on-write is disabled when B-tree indexes exist. Run engine.colcompress_merge() after bulk loads to re-establish global sort order.
colcompress_repack is not online. It acquires AccessExclusiveLock for the full duration. Schedule during maintenance windows.
AFTER STATEMENT triggers are supported. Only row-level (FOR EACH ROW) AFTER triggers are blocked.
Management Reference
| Function | Description |
|---|---|
engine.alter_colcompress_table_set(regclass, ...) |
Set options on a colcompress table |
engine.alter_colcompress_table_reset(regclass, ...) |
Reset colcompress options to system defaults |
engine.colcompress_merge(regclass) |
Rewrite and globally sort a colcompress table by its orderby key |
engine.colcompress_repack(regclass) |
Alias for colcompress_merge; drop-in replacement for pg_repack
|
engine.alter_rowcompress_table_set(regclass, ...) |
Set options on a rowcompress table |
engine.alter_rowcompress_table_reset(regclass, ...) |
Reset rowcompress options to system defaults |
engine.rowcompress_repack(regclass) |
Rewrite all batches with current options |
Catalog views for inspection:
| View | Description |
|---|---|
engine.colcompress_options |
Per-table options for all colcompress tables |
engine.colcompress_stripes |
Stripe-level metadata (offset, size, row range) per table |
engine.rowcompress_options |
Per-table options for all rowcompress tables |
engine.rowcompress_batches |
Batch-level metadata for all rowcompress tables |
All views grant SELECT to PUBLIC.
Installation
Build from source
Requires PostgreSQL server headers and pg_config in PATH. Supports PostgreSQL 13–18.
cd dist/
sudo make -j$(nproc) install
Add to postgresql.conf:
shared_preload_libraries = 'storage_engine'
If citus or pg_cron are also in shared_preload_libraries, load order matters — citus must appear before storage_engine:
shared_preload_libraries = 'pg_cron,citus,storage_engine'
Then:
CREATE EXTENSION storage_engine;
License
AGPL-3.0. This is relevant for organizations that distribute modified versions or embed the extension in SaaS products — review the license terms accordingly.
Links
- GitHub: https://github.com/saulojb/storage_engine
- PGXN: https://pgxn.org/ (search: storage_engine)
- BENCHMARKS.md: https://github.com/saulojb/storage_engine/blob/main/BENCHMARKS.md
- Hydra Columnar (upstream): https://github.com/hydradatabase/hydra
Feedback, issues, and pull requests are welcome on GitHub.
Saulo José Benvenutti — Data Architect / PostgreSQL DBA
📧 saulojb@gmail.com · 🔗 github.com/saulojb


Top comments (0)