Polliog

Posted on Feb 3

TimescaleDB Compression: From 150GB to 15GB (90% Reduction, Real Production Data)

#postgres #opensource #programming #database

When you're storing time-series data at scale, storage explodes fast. We hit 150GB in raw log data within months.

Then we enabled TimescaleDB compression.

Storage dropped to 15GB. Same data. Same query performance. 90% reduction.

Here's exactly how it works, with real production numbers from Logtide - an open-source log management platform handling 500,000 logs per day.

The Storage Problem

Time-series data grows relentlessly. When you're ingesting events at scale, storage becomes a major concern.

Our numbers (Logtide production):

500,000 logs/day
Average log size: ~300 bytes
Daily data: 150MB raw
Monthly data: 4.5GB
Yearly projection: 54GB

Sounds manageable? Add indexes:

Reality with indexes:

Primary key index: 2x data size
Search indexes (GIN): 3x data size
Timestamp B-tree: 1x data size
Total storage multiplier: 6x

Actual monthly storage: 27GB
Yearly projection: 324GB

For self-hosted deployments, this means expensive NVMe storage or slow HDDs. For managed databases, storage costs add up quickly with backups, snapshots, and IOPS charges.

Enter TimescaleDB Compression

TimescaleDB is a PostgreSQL extension that adds time-series superpowers to Postgres. One killer feature: native columnar compression via the Hypercore engine.

It works differently from general-purpose compression:

Traditional row-based storage:

Row 1: [timestamp, level, message, metadata]
Row 2: [timestamp, level, message, metadata]
Row 3: [timestamp, level, message, metadata]

TimescaleDB compressed columnstore:

Compressed batch:
  timestamp: [t1, t2, t3, ...] (1000 values)
  level:     [info, error, info, ...] (compressed)
  message:   [msg1, msg2, msg3, ...] (compressed)
  metadata:  [m1, m2, m3, ...] (compressed)

Key insight: Similar data compresses better when stored together.

Timestamps are sequential → excellent compression
Log levels repeat (info, error, warn) → extreme compression
Metadata fields often duplicate → good compression

How We Implemented It (2026 Best Practices)

Step 1: Convert Table to Hypertable

First, we converted our regular PostgreSQL table into a TimescaleDB hypertable using the modern syntax:

-- Our existing table
CREATE TABLE logs (
  id BIGSERIAL PRIMARY KEY,
  timestamp TIMESTAMPTZ NOT NULL,
  organization_id UUID NOT NULL,
  level TEXT NOT NULL,
  message TEXT NOT NULL,
  metadata JSONB,
  source TEXT,
  trace_id TEXT
);

-- Modern TimescaleDB 2.23+ syntax with columnstore enabled
CREATE TABLE logs (
  timestamp TIMESTAMPTZ NOT NULL,
  organization_id UUID NOT NULL,
  level TEXT NOT NULL,
  message TEXT NOT NULL,
  metadata JSONB,
  source TEXT,
  trace_id TEXT
) WITH (
  tsdb.hypertable
);

What this does:

Automatically creates a hypertable
Enables columnstore by default (new in 2.18+)
Partitions data into 1-day chunks automatically
Each day = new chunk
Old chunks get compressed, new data stays in rowstore for fast writes

Step 2: Configure Columnstore Settings

With TimescaleDB 2.22+, compression settings are automatically optimized:

-- TimescaleDB 2.22+ automatically recommends best settings
-- based on table configuration and statistics
SELECT alter_columnstore(
  'logs',
  segmentby => ARRAY['organization_id', 'level'],
  orderby => 'timestamp DESC'
);

Breaking it down:

segmentby = 'organization_id,level':

Group data by these columns
Each unique combination gets its own compressed segment
We query by organization frequently → huge performance boost
Level is low cardinality (info/error/warn/debug) → excellent compression

orderby = 'timestamp DESC':

Within each segment, order by timestamp descending
Matches our query pattern (newest logs first)
Sequential timestamps compress extremely well

New in 2.22/2.23: TimescaleDB now dynamically re-evaluates these settings as data patterns change, ensuring optimal compression over time.

Step 3: Enable Compression Policy

Automation is key:

-- Compress chunks older than 7 days automatically
SELECT add_columnstore_policy(
  'logs',
  compress_after => INTERVAL '7 days'
);

What this does:

Automatically compresses chunks older than 7 days
Runs in the background via TimescaleDB's job scheduler
Recent data (last 7 days) stays in rowstore for fast writes
Old data moves to columnstore for space savings

Why 7 days?

Most queries are on recent data (last 24-48 hours)
After 7 days, logs are mostly for historical analysis
Balances write performance vs storage savings

You can adjust based on your access patterns:

Heavy recent writes → compress after 14-30 days
Mostly historical queries → compress after 1-3 days
Read-heavy workload → compress after 1 day

Our Real Production Results (2026)

We enabled compression in December 2025. Here are the actual numbers:

Storage Reduction

Before compression (November 2025):

Raw data: 135GB
Indexes: 85GB
Total: 220GB

After compression (January 2026):

Compressed columnstore: 18GB
Sparse indexes: 7GB
Total: 25GB

Compression ratio: 88.6% (we round to 90%)

Storage Over Time

Here's how compression progressed:

Day 0:  220GB (no compression)
Day 7:  215GB (first chunk compressed: -5GB)
Day 14: 180GB (2 weeks compressed: -40GB)
Day 30: 90GB  (1 month compressed: -130GB)
Day 60: 25GB  (all historical data compressed)

Linear reduction as more chunks hit the 7-day threshold.

Query Performance

Critical question: Does compression slow down queries?

Our testing (1M log search across 30 days):

Before compression:

Full-text search: 145ms
Time range filter: 12ms
Aggregation (count by level): 89ms

After compression (with SIMD vectorization):

Full-text search: 163ms (+12%)
Time range filter: 8ms (-33%)
Aggregation (count by level): 52ms (-41%)

Surprising result: Aggregations got FASTER.

Why?

Less data to scan (90% reduction)
Columnar format is perfect for aggregations
SIMD vectorization (new in 2.24) accelerates SUM, COUNT, AVG, MIN, MAX
Sparse indexes skip irrelevant compressed chunks

Trade-off: Text search slightly slower (needs decompression), but still under 200ms.

Decompression Overhead

Compressed chunks need decompression during queries. CPU cost?

Benchmark: 100,000 log retrieval from compressed chunk

SELECT * FROM logs
WHERE timestamp > NOW() - INTERVAL '30 days'
  AND organization_id = 'test-org'
LIMIT 100000;

Results:

Rowstore: 1,240ms, CPU: 15%
Columnstore: 1,680ms, CPU: 28%
Overhead: +440ms, +13% CPU

In practice: Negligible. Modern CPUs with SIMD decompress data faster than disks can read uncompressed data.

TimescaleDB 2024-2026: What's New

The past two years brought massive improvements to compression and analytics:

1. Chunk-Skipping Indexes (2024)

Problem: Querying compressed data by non-time columns was slow.

Solution: Min/max sparse indexes on compressed chunks.

-- Create index on level column
CREATE INDEX idx_level ON logs(level);

-- TimescaleDB 2.x automatically creates sparse index for compressed chunks
-- Query benefits from chunk exclusion:
SELECT * FROM logs WHERE level = 'error' AND timestamp > NOW() - INTERVAL '30 days';
-- Only scans chunks containing 'error' level logs

Performance gain: Up to 7x faster scans on compressed data.

2. SIMD Vectorization (2024-2025)

Problem: Aggregate queries on columnar data were CPU-intensive.

Solution: Single Instruction, Multiple Data (SIMD) processing.

When you run queries with aggregations (SUM, COUNT, AVG, MIN, MAX) grouped by segment_by columns, modern CPUs process multiple values simultaneously.

Performance gain: 2-10x faster analytical queries on compressed data.

-- This query benefits from SIMD vectorization
SELECT 
  organization_id,
  level,
  COUNT(*) as count,
  AVG(LENGTH(message)) as avg_length
FROM logs
WHERE timestamp > NOW() - INTERVAL '30 days'
GROUP BY organization_id, level;
-- Runs 5-7x faster with SIMD on compressed chunks

3. UUIDv7 Compression & Partitioning (2025)

Problem: UUIDs are random and don't compress well. UUIDv4 partitioning was impossible.

Solution: TimescaleDB 2.22+ supports UUIDv7 (timestamp-based UUIDs).

-- Partition hypertable by UUIDv7 column
CREATE TABLE events (
  event_id UUID DEFAULT gen_uuidv7() NOT NULL,
  timestamp TIMESTAMPTZ DEFAULT NOW(),
  data JSONB
) WITH (
  tsdb.hypertable = '{
    "partition_column": "event_id",
    "partition_type": "uuidv7"
  }'
);

Benefits:

30% better compression on UUIDv7 columns
2x faster queries with UUIDv7 in WHERE clauses
Time-based partitioning without separate timestamp column
Utility functions: uuid_timestamp(), to_uuidv7_boundary()

4. Continuous Aggregates Performance (2024)

Real-time continuous aggregates got 50,000x faster for certain workloads.

What are continuous aggregates?

Think of them as materialized views that automatically update incrementally:

-- Pre-compute hourly log statistics
CREATE MATERIALIZED VIEW logs_hourly
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 hour', timestamp) AS hour,
  organization_id,
  level,
  COUNT(*) as count,
  SUM(LENGTH(message)) as total_bytes
FROM logs
GROUP BY hour, organization_id, level;

-- Refresh policy (runs automatically)
SELECT add_continuous_aggregate_policy(
  'logs_hourly',
  start_offset => INTERVAL '3 hours',
  end_offset => INTERVAL '1 hour',
  schedule_interval => INTERVAL '1 hour'
);

Performance improvements (2024):

10-20% faster DML (inserts/updates) - removed trigger-based invalidation
Real-time queries - compute results on new data before materialization
Compression support - continuous aggregates can now be compressed too

Example use case:

Raw data: 500k logs/day (150GB/month)
Hourly aggregate: 24 rows/day (1MB/month)
Dashboard queries: 1000x faster (query aggregates, not raw logs)

5. SkipScan for Multi-Column Indexes (2025)

TimescaleDB 2.22 added SkipScan for multi-column indexes.

Problem: DISTINCT queries on large datasets are slow.

Solution: SkipScan leverages B-tree indexes to skip duplicate values.

-- Get unique organizations (across 100M rows)
SELECT DISTINCT organization_id FROM logs;

-- Before SkipScan: Full scan (45 seconds)
-- With SkipScan: Index-only scan (0.5 seconds)
-- Performance: 90x faster

Works with multi-column indexes too:

CREATE INDEX idx_org_level ON logs(organization_id, level);

-- Both columns benefit from SkipScan
SELECT DISTINCT organization_id, level FROM logs;

6. Direct-to-Columnstore Inserts (2023-2025)

Problem: Bulk imports write to rowstore, then wait for compression policy.

Solution: Direct-to-columnstore feature (tech preview in 2.23+).

-- Enable direct columnstore writes
SET timescaledb.enable_direct_compress_insert = on;

-- Bulk insert directly to columnstore
COPY logs FROM '/data/historical_logs.csv' CSV;
-- Writes directly in columnar format (faster + compressed immediately)

Use case: Historical data backfills, migrations, large imports.

Performance gain: Up to 5x faster imports for cold data.

7. Foreign Keys on Hypertables (2024)

Finally! After years of requests, TimescaleDB 2.24 added foreign key support.

CREATE TABLE organizations (
  org_id UUID PRIMARY KEY,
  name TEXT
);

CREATE TABLE logs (
  timestamp TIMESTAMPTZ NOT NULL,
  organization_id UUID NOT NULL REFERENCES organizations(org_id) ON DELETE CASCADE,
  message TEXT
) WITH (tsdb.hypertable);

-- Foreign keys now work with CASCADE
DELETE FROM organizations WHERE org_id = '...';
-- Automatically deletes related logs

Note: Some performance overhead on writes (foreign key validation). Use judiciously.

Advanced Compression Techniques

Segmentby Optimization

Choosing the right segmentby columns is crucial.

Bad choice:

segmentby => 'trace_id'

Why bad? High cardinality. Each trace_id is unique → no compression benefit.

Good choice:

segmentby => 'organization_id,level'

Why good?

organization_id: Medium cardinality (~1000 orgs)
level: Low cardinality (4 values: debug/info/warn/error)
We query by organization constantly
Filtering by level is common

Rule of thumb:

Segment by columns you frequently filter on
Low to medium cardinality (<10,000 unique values)
Columns that appear in WHERE clauses

Orderby Optimization

orderby => 'timestamp DESC, trace_id ASC'

Why this order?

timestamp DESC: Most queries want recent logs first
trace_id ASC: For distributed tracing, keep related logs together

TimescaleDB stores min/max values for orderby columns. This lets it skip entire compressed chunks without decompression:

-- Query: Get logs from last 2 hours
SELECT * FROM logs
WHERE timestamp > NOW() - INTERVAL '2 hours';

-- TimescaleDB checks compressed chunk metadata:
-- Chunk 1: timestamp range [Jan 1, Jan 2] → SKIP
-- Chunk 2: timestamp range [Jan 2, Jan 3] → SKIP
-- Chunk 3: timestamp range [Jan 3, NOW] → SCAN

Zero decompression for irrelevant chunks.

Compression Settings by Use Case

High write throughput (>100k inserts/sec):

segmentby => 'device_id',
orderby => 'timestamp DESC',
compress_after => INTERVAL '30 days'  -- Keep writes fast

Low cardinality data (sensors, metrics):

segmentby => 'sensor_id,location',
orderby => 'timestamp DESC'
-- Extreme compression ratios (95%+)

High cardinality events (unique IDs):

segmentby => 'user_region',  -- Don't use user_id!
orderby => 'timestamp DESC,user_id ASC'
-- Modest compression (70-80%)

Mixed workload (Logtide):

segmentby => 'organization_id,level',
orderby => 'timestamp DESC'
-- 90% compression, fast queries

Real-World Compression Ratios

Industry data from TimescaleDB users and case studies:

Use Case	Compression Ratio	Notes
IoT sensor data	95%+	Highly repetitive
Application logs	85-92%	Our case: 90%
Financial tick data	93-97%	Sequential numbers
Server metrics	88-94%	Prometheus-style
E-commerce events	75-85%	High cardinality
Cloudflare analytics	84%	100k events/sec

Key insight: Time-series data compresses exceptionally well. 90% reduction is normal, not exceptional.

Gotchas and Limitations (2026 Edition)

1. Compressed Chunks Are Read-Only

You cannot INSERT/UPDATE/DELETE on compressed chunks:

-- This will fail:
INSERT INTO logs VALUES (...);  -- timestamp in compressed chunk
ERROR: cannot insert into compressed chunk

Solution: Compression policy with appropriate delay (7-30 days).

Recent data stays in rowstore. Once in columnstore, it's historical (no writes needed).

If you MUST modify compressed data:

-- Decompress chunk
SELECT decompress_chunk('_timescaledb_internal._hyper_1_234_chunk');

-- Modify data
UPDATE logs SET level = 'error' WHERE ...;

-- Recompress
SELECT compress_chunk('_timescaledb_internal._hyper_1_234_chunk');

But this defeats the purpose. Design your system to avoid modifying old data.

Exception: Direct-to-columnstore backfills (see section 6 above).

2. Schema Changes Require Care

Adding columns to compressed hypertables:

-- Safe:
ALTER TABLE logs ADD COLUMN new_field TEXT DEFAULT NULL;

-- Dangerous:
ALTER TABLE logs ADD COLUMN new_field TEXT NOT NULL;
-- Requires decompressing all chunks!

Best practice: Add new columns as nullable. Backfill later if needed.

New in 2.23+: Better support for schema changes on columnstore without full decompression.

3. Backup Size Doesn't Shrink

pg_dump outputs uncompressed data:

# Compressed table: 25GB on disk
pg_dump logtide > backup.sql

# backup.sql size: 220GB (uncompressed!)

Solution: Use filesystem-level compression:

pg_dump logtide | gzip > backup.sql.gz
# Now: 30GB (similar to on-disk size)

Or use pg_basebackup for binary backups (preserves compression).

4. Monitoring Compression

Check compression status:

SELECT
  chunk_name,
  range_start,
  range_end,
  compression_status,
  before_compression_total_bytes,
  after_compression_total_bytes,
  ROUND(
    100.0 * (1 - after_compression_total_bytes::numeric / before_compression_total_bytes::numeric),
    2
  ) AS compression_ratio
FROM timescaledb_information.chunks
WHERE hypertable_name = 'logs'
ORDER BY range_start DESC
LIMIT 10;

Output:

chunk_name                 | range_start | range_end  | compression_status | compression_ratio
---------------------------|-------------|------------|-------------------|------------------
_hyper_1_234_chunk         | 2026-01-25  | 2026-01-26 | Compressed        | 91.2%
_hyper_1_233_chunk         | 2026-01-24  | 2026-01-25 | Compressed        | 89.8%
_hyper_1_232_chunk         | 2026-01-23  | 2026-01-24 | Uncompressed      | NULL

5. CPU vs Storage Trade-off

Decompression uses CPU. If your queries are CPU-bound (not I/O-bound), compression might hurt.

Our case: I/O-bound (reading from disk is slow). CPU overhead is negligible with SIMD.

Test before deploying: Run your actual query workload on compressed data.

Migration Strategy (2026 Edition)

Don't compress everything at once. Roll out gradually:

Phase 1: Test on Non-Critical Table

CREATE TABLE test_logs AS SELECT * FROM logs LIMIT 100000;

-- Modern syntax (2.18+)
CREATE TABLE test_logs_hyper (
  LIKE test_logs
) WITH (tsdb.hypertable);

-- Configure columnstore
SELECT alter_columnstore(
  'test_logs_hyper',
  segmentby => ARRAY['organization_id'],
  orderby => 'timestamp DESC'
);

-- Add compression policy
SELECT add_columnstore_policy('test_logs_hyper', compress_after => INTERVAL '1 day');

Run queries. Check performance. Verify results.

Phase 2: Enable on Production (Manual)

-- Configure columnstore on existing hypertable
SELECT alter_columnstore(
  'logs',
  segmentby => ARRAY['organization_id', 'level'],
  orderby => 'timestamp DESC'
);

-- Manually compress one chunk
SELECT compress_chunk('_timescaledb_internal._hyper_1_100_chunk');

Monitor queries. Watch for issues. Validate compression ratio.

Phase 3: Automate Compression

SELECT add_columnstore_policy('logs', compress_after => INTERVAL '7 days');

Let it run. Compression happens in background. No downtime.

Phase 4: Monitor and Tune

Watch metrics:

Compression ratio per chunk
Query latency (before/after)
CPU usage during queries
Disk I/O reduction

Adjust segmentby and orderby if needed (2.22+ automatically re-evaluates).

When NOT to Use Compression

1. Frequently updated data

Compressed chunks are read-only. If you UPDATE/DELETE often, don't compress.

2. Random access patterns

If queries access random chunks (not time-based), decompression overhead hurts.

3. CPU-constrained workloads

If CPU is your bottleneck (not I/O), compression might make things worse.

4. Low data volume

<10GB total data? Compression overhead isn't worth it. Just use more storage.

5. High write throughput with low retention

Writing 1M logs/sec, deleting after 1 day? Compression won't help (data is deleted before compression policy runs).

Try It Yourself: Logtide Setup

Logtide uses TimescaleDB compression in production. See it in action:

git clone https://github.com/logtide-dev/logtide.git
cd logtide
docker compose up -d

Our compression configuration:

-- From logtide/migrations/003_compression.sql (2026 syntax)
SELECT alter_columnstore(
  'logs',
  segmentby => ARRAY['organization_id', 'level'],
  orderby => 'timestamp DESC'
);

SELECT add_columnstore_policy('logs', compress_after => INTERVAL '7 days');

-- Continuous aggregates for analytics
CREATE MATERIALIZED VIEW logs_hourly
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 hour', timestamp) AS bucket,
  organization_id,
  level,
  COUNT(*) as count
FROM logs
GROUP BY bucket, organization_id, level;

-- Compress the aggregate too
SELECT alter_columnstore(
  'logs_hourly',
  segmentby => ARRAY['organization_id', 'level'],
  orderby => 'bucket DESC'
);

SELECT add_columnstore_policy('logs_hourly', compress_after => INTERVAL '30 days');

Result:

500k logs/day → 15GB/month compressed
Fast queries (<200ms)
Self-hosted on commodity hardware
Zero managed database fees

Other TimescaleDB Superpowers

Beyond compression, TimescaleDB offers features that make it perfect for time-series workloads:

1. Automatic Data Retention

Delete old data without expensive DELETE queries:

-- Drop chunks older than 90 days automatically
SELECT add_retention_policy('logs', INTERVAL '90 days');

PostgreSQL equivalent:

-- This is slow and locks the table:
DELETE FROM logs WHERE timestamp < NOW() - INTERVAL '90 days';

TimescaleDB: Just drops entire chunks (instant, no locks).

2. Time Bucket Analytics

Powerful time-based aggregation function:

-- Get hourly log counts
SELECT
  time_bucket('1 hour', timestamp) AS hour,
  level,
  COUNT(*) as count
FROM logs
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY hour, level
ORDER BY hour DESC;

Works with any interval: 1 second, 5 minutes, 1 day, 1 month, etc.

3. Hyperfunctions (Timescale Toolkit)

Advanced analytics functions written in Rust:

-- Time-weighted averages
SELECT
  time_bucket('1 hour', timestamp) AS hour,
  time_weight('LOCF', timestamp, temperature) as avg_temp
FROM sensor_data
GROUP BY hour;

-- Percentile approximation (fast)
SELECT
  time_bucket('1 hour', timestamp) AS hour,
  approx_percentile(0.95, response_time) as p95_latency
FROM api_logs
GROUP BY hour;

Install: CREATE EXTENSION timescaledb_toolkit;

4. Compression Tuple Filtering (2024)

Problem: UPDATEs/DELETEs on compressed data were slow (decompress everything).

Solution: Compression tuple filtering - skip irrelevant batches.

Performance gain: Up to 500x faster updates/deletes on compressed data.

-- Update specific logs in compressed chunks
UPDATE logs SET level = 'critical'
WHERE organization_id = 'acme'
  AND timestamp > NOW() - INTERVAL '30 days'
  AND message LIKE '%database timeout%';
-- Only decompresses batches for 'acme' org

5. Job Scheduler

Built-in job scheduler (no pg_cron needed):

-- Custom job: Generate daily summary report
SELECT add_job(
  'generate_daily_summary',
  '1 day',
  config => '{"retention": "30 days"}'
);

CREATE OR REPLACE PROCEDURE generate_daily_summary(config JSONB)
LANGUAGE SQL AS $$
  INSERT INTO daily_summaries
  SELECT
    DATE(timestamp) as day,
    organization_id,
    COUNT(*) as total_logs,
    COUNT(*) FILTER (WHERE level = 'error') as errors
  FROM logs
  WHERE DATE(timestamp) = CURRENT_DATE - 1
  GROUP BY day, organization_id;
$$;

Key Takeaways

1. TimescaleDB compression is insanely effective

85-95% reduction typical for time-series data
Our case: 90% reduction in production
Cloudflare case study: 84% reduction at 100k events/sec

2. Query performance often improves

Less data to scan
Columnar format perfect for analytics
SIMD vectorization makes aggregations blazing fast
Sparse indexes skip irrelevant chunks

3. Compression is automatic

Set policy once
Runs in background
Zero operational overhead
Dynamic re-optimization (2.22+)

4. Storage costs drop dramatically

Scales linearly with data volume
Especially impactful for high-volume workloads
Self-hosting becomes viable at scale

5. It's just PostgreSQL

No new query language
Standard SQL
Works with existing tools
ORMs compatible
Postgres 18 support (2026)

6. 2024-2026 improvements are massive

SIMD vectorization (2-10x faster analytics)
UUIDv7 support (30% better compression)
Chunk-skipping indexes (7x faster scans)
Continuous aggregates (50,000x speedup)
SkipScan (90x faster DISTINCT)
Foreign keys finally supported

Bottom Line

If you're storing time-series data in PostgreSQL, use TimescaleDB compression.

Setup time: 30 minutes
Storage savings: 90%
Query performance: Often better
Operational overhead: Zero

It's a no-brainer.

220GB → 25GB. Same queries. Way faster analytics.

Sometimes the best optimizations are the ones someone else already built.

Logtide: Open-source log management with TimescaleDB compression built-in.

GitHub: https://github.com/logtide-dev/logtide
Website: https://logtide.dev
Docs: https://logtide.dev/docs
TimescaleDB Docs: https://docs.timescale.com