Team Tiger Data for Tiger Data (Creators of TimescaleDB)

Posted on Jun 9 • Originally published at tigerdata.com on Jun 5

Row vs Columnar Storage for Analytics: Why PostgreSQL Scans Are Slower Than They Should Be

#postgres #realtimeanalytics #storage #analytics

Here's a query that runs on most time-series tables:

SELECT time_bucket('1 hour', ts) AS hour,
       avg(temperature),
       max(temperature)
FROM sensor_readings
WHERE ts > now() - interval '7 days'
GROUP BY hour
ORDER BY hour;

The query needs two columns: ts and temperature. The table has 15 columns. Postgres reads all 15 columns for every row that matches the WHERE clause.

That's not a bug. It's how row-oriented storage works. Each row is stored as a contiguous block of bytes on disk, called a heap tuple, and Postgres reads the entire tuple to access any column within it. For point lookups on individual records, this is efficient. You want the whole row, and it's stored together. For analytical scans over millions of rows where you need two columns out of fifteen, it's the dominant source of wasted I/O.

In Understanding Postgres Performance Limits for Analytics on Live Data, row-oriented storage was identified as one of four architectural constraints that compound under high-frequency ingestion. That whitepaper maps the pattern at a system level. This post goes deeper on the physical mechanism: exactly how pages work, how read amplification accumulates, and why the usual fixes don't reach it.

What You Will Learn

By the end of this post, you'll have a concrete diagnostic formula: the read amplification ratio. It tells you whether your storage layout is the dominant I/O bottleneck for analytical queries on any table you own. You'll also understand why indexes can't fix this class of problem and how a hybrid row-columnar storage layout changes the math. This post assumes working familiarity with Postgres page layout and B-tree indexes.

How Row Storage Actually Works in Postgres

Postgres stores data in 8KB pages. Each page holds multiple heap tuples. Each tuple contains every column value for that row, stored sequentially, preceded by a 23-byte header that carries transaction visibility metadata.

A table with 15 columns averaging 200 bytes per row fits roughly 35 to 40 rows per page, after accounting for headers, alignment padding, and page overhead.

When Postgres runs a sequential scan, it reads pages from disk in order. Each page load brings all the rows on that page into shared_buffers, with all 15 columns per row intact. The executor then evaluates the WHERE clause and pulls the needed columns from what was already loaded into memory.

The I/O cost is proportional to total table size, not to the size of the queried columns. A query that needs 12 bytes of data per row still reads 200 bytes from disk. The remaining 188 bytes load into the buffer cache and get discarded.

The Read Amplification Math

The number that makes this concrete is the read amplification ratio: total row width divided by the width of the columns the query actually needs.

For sensor_readings, the calculation is direct. The ts column is a timestamptz at 8 bytes. The temperature column is a float4 at 4 bytes. Together they represent 12 bytes of useful data per row. The full row is 200 bytes.

Read amplification ratio: 200 ÷ 12 = 16.7x

For every byte the query uses, Postgres reads 16.7 bytes from disk.

At 100 million rows covering seven days, that ratio stops being abstract. The query needs 100M x 12 bytes = 1.14 GB. Postgres reads 100M x 200 bytes = 18.6 GB. At a 500 MB/sec sequential read rate, the scan takes approximately 38 seconds. Reading only the needed columns would take roughly 2.3 seconds. That 16x gap is pure storage model overhead.

No index changes this number. No configuration setting changes it. Partitioning reduces scope. Fewer pages get scanned by cutting the time range, but within each partition the same per-row read cost applies. The storage layout determines the I/O, and the storage layout is fixed.

Try This Now: Measure Your Read Amplification

You can calculate the ratio for any table you own. Run these two queries to get the byte widths you need:

-- Full row weight
SELECT pg_column_size(t.*) AS row_bytes
FROM sensor_readings t
LIMIT 1;

-- Queried column weight
SELECT pg_column_size(ts) + pg_column_size(temperature) AS queried_bytes
FROM sensor_readings
LIMIT 1;

Divide row_bytes by queried_bytes. If the ratio is above 5x, the storage model is your largest I/O bottleneck for analytical queries on that table. No index or configuration change will close that gap.

Why Indexes Don’t Solve This

When a query is slow, the instinctive response is to add an index. For OLTP workloads, that instinct is correct. B-tree indexes excel at row selection: they find specific rows in O(log n) time, and for a lookup like SELECT * FROM users WHERE id = 123, the index locates the target row in microseconds.

For analytical queries that touch millions of rows, row selection is not the bottleneck. Finding the rows is fast. Reading the data from those rows is slow. An index scan on a million-row result set still reads the full heap tuple for every matching row to extract the needed columns.

The one exception is a covering index, which stores column values inside the index itself so Postgres can satisfy the query without touching the heap. But covering indexes for analytical queries become impractical at scale. When queries involve aggregations across high-frequency writes, wide covering indexes impose substantial write overhead, compounding exactly the index maintenance costs described in the optimization treadmill post.

B-tree indexes optimize for row selection (which rows to read). Analytical query performance is dominated by row width (how much data per row). These are different problems, and solving one leaves the other intact. For a broader look at what this means for your schema design, see Best Practices for PostgreSQL Data Analysis.

How Columnar Storage Changes the Equation

In columnar storage, data is organized by column instead of by row. All values for ts live together in one stream on disk. All values for temperature live together in another. When the query needs those two columns, it reads two streams. The other 13 columns are never touched.

Same query, same 100 million rows: data read drops to 100M x 12 bytes = 1.14 GB. With typical 10 to 20x compression for time-series data, that compresses to approximately 60 to 120 MB. At 500 MB/sec, the same scan completes in roughly 0.12 to 0.24 seconds.

The compression benefit stacks on top of the I/O reduction. Because all values in a column share the same data type, compression algorithms work far more effectively. Sequential timestamps delta-encode to near-zero storage overhead. Floating-point sensor values compress with XOR-based techniques derived from Facebook's Gorilla algorithm. Row-oriented heap storage can't apply any of these because values from different columns are interleaved on every page. There's no contiguous column stream to compress.

Hypercore: Row and Columnar in One Table

The tradeoff with pure columnar storage is write performance. Every new row appends to each column file separately, which adds overhead for high-frequency ingestion. You get the read benefit but give up write throughput. Tiger Data's Hypercore solves this with a hybrid layout that keeps both.

Recent data stays in row-oriented storage for fast ingestion. Older data converts automatically to columnar format based on a compression policy you configure. The application writes standard SQL to one table. The storage format changes by age without any application-layer involvement.

-- Enable Hypercore on a hypertable with a 7-day row storage window
ALTER TABLE sensor_readings SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'device_id',
    timescaledb.compress_orderby = 'ts DESC'
);

SELECT add_compression_policy('sensor_readings', INTERVAL '7 days');

New rows land in row format and ingest quickly. Data older than seven days converts to columnar chunks. To verify the behavior immediately without waiting for the policy schedule, compress a chunk manually:

SELECT compress_chunk(c) FROM show_chunks('sensor_readings') c LIMIT 1;

Then run EXPLAIN (ANALYZE, BUFFERS) on the aggregation query to see the difference in buffer reads (representative output on a 100M-row dataset):

-- Before: row storage sequential scan
Seq Scan on sensor_readings
  Buffers: shared read=2375000 -- 18.6 GB read from disk
  Execution Time: 38142.2 ms

-- After: Hypercore columnar scan
Custom Scan (ColumnarScan) on sensor_readings
  Buffers: shared read=10240 -- 80 MB read from disk
  Execution Time: 196.4 ms

The same SELECT statement works against both storage formats. The query planner handles the difference transparently.

Conclusion

Row storage reads every column to access any column. For analytical queries that scan millions of rows and need only a few, this is the largest source of I/O overhead. It doesn't yield to index tuning, partitioning, or hardware upgrades.

Calculate the read amplification ratio for your most common analytical queries using the pg_column_size queries above. If the ratio is above 5x, Hypercore is the direct fix. Start a free Tiger Data trial today to enable the hybrid storage model on your tables.

DEV Community