Team Timescale for Timescale

Posted on Jan 22 • Originally published at timescale.com on Jan 10

Handling Billions of Rows in PostgreSQL

#postgres #database #devops #productivity

Handling a table with billions of rows in PostgreSQL (or any relational database) can be challenging due to the high level of data complexity, the significant amount of storage space consumed, and performance issues with more complex or analytical queries.

These challenges can all be solved by enabling columnstore (which compresses data) in Timescale and by using Timescale’s chunk-skipping indexes. Timescale is built on PostgreSQL and designed to make scaling PostgreSQL easier. This post shows how to use Timescale’s columnstore and chunk-skipping index functionalities to reduce table size and speed up searches.

Here’s the methodology we’ll follow.

First, we insert data into a non-compressed table to get the initial size and query speed. Then, we compare these results with a compressed table. Let's dive in.

We will use PostgreSQL on Timescale Cloud—a fully managed database service designed to handle time-series data efficiently. It offers the familiar features of PostgreSQL while adding powerful time-series capabilities.

Features include automatic scaling, high availability, and various performance optimizations, making it easier for developers to store, manage, and query large volumes of time-series data without worrying about infrastructure management.

Here are the instance details that I used for these tests:

Instance type: Time series
CPU: 4 cores
RAM: 16 GB

Benchmarking Uncompressed Table

First, we create a PostgreSQL heap table named sensor_uncompressed in the time-series database and ingest one billion rows into it. After that, we check its statistics, including table size and SELECT query performance.

Step 1: Create a table

CREATE TABLE sensors_uncompressed (
  sensor_id INTEGER, 
  ts TIMESTAMPTZ NOT NULL, 
  value REAL
);

Step 2: Create an index

CREATE INDEX sensors_ts_idx_uncompressed ON sensors_uncompressed (sensor_id, ts DESC);

Step 3: Ingest data

The dataset was placed on an AWS S3 bucket, so we used the timescaledb-parallel-copy utility to ingest data inside the table. timescaledb-parallel-copy is a command line program for parallelizing PostgreSQL's built-in COPY functionality for bulk-inserting data into TimescaleDB.

curl https://ts-devrel.s3.amazonaws.com/sensors.csv.gz |gunzip | timescaledb-parallel-copy -batch-size 5000 -connection $DATABASE_URI -table sensors_uncompressed -workers 4 -split '\t'

Here are some statistics after successfully ingesting one billion rows into the PostgreSQL heap table.

Time taken to ingest the data: 49 min 12 sec
Total table size, including index and data: 101 GB

Step 4: Running aggregate queries

The goal is to compare query execution times by running various scaled aggregate queries on both compressed and uncompressed tables, observing how compressed tables perform in relation to uncompressed ones.

Query 1

SELECT * FROM sensors_uncompressed 
WHERE sensor_id = 0 
AND ts >= '2023-12-21 07:15:00'::timestamp 
AND ts <= '2023-12-21 07:16:00'::timestamp;
Execution Time: 38 ms

Query 2

SELECT sensor_id, DATE_TRUNC('day', ts) AS day, MAX(value) AS max_value, MIN(value) AS min_value 
FROM sensors_uncompressed 
WHERE ts >= DATE '2023-12-21' AND ts < DATE '2023-12-22'
GROUP BY sensor_id, DATE_TRUNC('day', ts) 
ORDER BY sensor_id, day;
Execution Time: 6 min 31 sec

Query 3

SELECT sensor_id, ts, value 
FROM sensors_uncompressed 
WHERE ts >= '2023-12-21 07:15:00' 
AND ts < '2023-12-21 07:20:00' 
ORDER BY value DESC 
LIMIT 5;
Execution Time: 6 min 24 sec

Benchmarking Compressed Hypertable

It is now time to gather statistics for a compressed hypertable (a PostgreSQL table that automatically partitions data by time) utilizing Timescale's columnstore method.

Step 1: Create a table

CREATE TABLE sensors_compressed (
  sensor_id INTEGER, 
  ts TIMESTAMPTZ NOT NULL, 
  value REAL
);

Step 2: Create an index

CREATE INDEX sensors_ts_idx_compressed ON sensors_compressed (sensor_id, ts DESC);

Step 3: Convert to hypertable

SELECT create_hypertable('sensors_compressed', by_range('ts', INTERVAL '1 hour'));

Step 4: Enable columnstore / compression

ALTER TABLE sensors_compressed SET (timescaledb.compress, timescaledb.compress_segmentby = 'sensor_id');

Step 5: Add compression policy

SELECT add_compression_policy('sensors_compressed', INTERVAL '24 hour');

Step 6: Ingest data

curl https://ts-devrel.s3.amazonaws.com/sensors.csv.gz |gunzip | timescaledb-parallel-copy -batch-size 5000 -connection $CONNECTION_STRING -table sensors_compressed -workers 4 -split '\t'

Here are the statistics after successfully ingesting one billion rows into the hypertable with compression enabled.

Time taken to ingest the data: 1 hr 03 mins 21
Total table size, including index and data: 5.5 GB

Step 7: Running aggregate queries

Query 1

SELECT * FROM sensors_compressed 
WHERE sensor_id = 0 
AND ts >= '2023-12-21 07:15:00'::timestamp 
AND ts <= '2023-12-21 07:16:00'::timestamp;
Execution Time: 20 ms

Query 2

SELECT sensor_id, DATE_TRUNC('day', ts) AS day, MAX(value) AS max_value, MIN(value) AS min_value 
FROM sensors_compressed 
WHERE ts >= DATE '2023-12-21' AND ts < DATE '2023-12-22'
GROUP BY sensor_id, DATE_TRUNC('day', ts) 
ORDER BY sensor_id, day;
Execution Time: 5 min

Query 3

SELECT sensor_id, ts, value 
FROM sensors_compressed 
WHERE ts >= '2023-12-21 07:15:00' 
AND ts < '2023-12-21 07:20:00' 
ORDER BY value DESC 
LIMIT 5;
Execution Time: 4.4 sec

Key takeaways

Storage efficiency : After enabling compression, the table size was reduced by approximately 95 %.
Aggregate query 1 is 47.37 % faster on the compressed table.
Aggregate query 2 is 23 % faster on the compressed table.
Aggregate query 3 is 98.83 % faster on the compressed table.

These results demonstrate the significant advantages of using TimescaleDB's compression feature, both in terms of storage savings and improved query performance. Enhancing Postgres Performance With Chunk-Skipping Indexes

Chunk-Skipping in Timescale

Further speeding up PostgreSQL performance and reducing storage footprint are Timescale’s chunk-skipping indexes (available as of TimescaleDB 2.16.0). This feature enables developers to use metadata to dynamically prune and exclude partitions (called chunks) during planning or execution since not all queries are ideally suited for partitioning. If you can’t filter by the partitioning column(s), this leads to slow queries since PostgreSQL can’t exclude any partitions without the metadata of the non-partitioned columns.

Chunk-skipping indexes optimize query performance by allowing us to bypass irrelevant chunks when searching through large datasets.

In TimescaleDB, data is organized into time-based chunks, each representing a subset of the overall hypertable. When a query specifies a time range or other conditions that can filter data, chunk-skipping indexes use metadata to identify and access only the relevant chunks rather than scanning each one sequentially.

This targeted access minimizes disk I/O and computational overhead, making queries faster and more efficient, especially in hypertables with billions of rows.

Let's create a table named product_orders with columns for order details, such as IDs, timestamps, quantity, total, address, and statuses.

CREATE TABLE product_orders (
    order_id serial,
    order_date timestamptz,
    customer_id int,
    product_id int,
    quantity int,
    order_total float,
    shipping_address text,
    payment_status text,
    order_status text 
);

Convert to hypertable

Transform the product_orders table into a TimescaleDB hypertable, partitioned by order_date with four-day intervals.

SELECT create_hypertable('product_orders', 'order_date', chunk_time_interval=>'4 day'::interval);

Ingest data

To ingest data, we will use a query that generates 50 million rows of dummy order data, simulating one order per minute starting from January 1, 2023. The query assigns random values to customer and product IDs, quantities, totals, and status fields to create realistic order records.

WITH time_series AS (
    SELECT generate_series(
        '2023-01-01 00:00:00'::timestamptz,
        '2023-01-01 00:00:00'::timestamptz + interval '50000000 minutes',
        '1 minute'::interval
    ) AS order_date
)
INSERT INTO product_orders (
    order_date, customer_id, product_id, quantity, order_total, 
    shipping_address, payment_status, order_status
)
SELECT
    Order_date,
    (random() * 1000)::int + 1 AS customer_id,
    (random() * 100)::int + 1 AS product_id,
    (random() * 10 + 1)::int AS quantity,
    (random() * 500 + 10)::float AS order_total,
    '123 Example St, Example City' AS shipping_address,
    CASE WHEN random() > 0.1 THEN 'Completed' ELSE 'Pending' END AS
Payment_status,
    CASE WHEN random() > 0.2 THEN 'Shipped' ELSE 'Pending' END AS
Order_status
FROM time_series;

Once the data ingestion is complete, let's execute a simple SELECT statement to measure the time taken for the query to execute.

tsbd=> # select * from product_orders where order_id = 50000000;
order_id | order_date | customer_id | product_id | quantity | order_total | shipping_address | payment_status | order_status
----------+------------------------+-------------+------------+----------+-------------------+------------------------------+----------------+--------------
50000000 | 2117-01-24 12:33:00+00 | 515 | 14 | 9 | 61.00540537187403 | 123 Example St, Example City | Completed | Shipped
(1 row)
Time: 42049.154 ms (00:42.049)

Currently, there is no index on the order_id column, which is why the query took nearly 42 seconds to execute.

Add index

Let's see if we can reduce the 42 seconds by creating a B-tree index on the order_id column.

create index order_id on product_orders (order_id);

After creating the index, let's rerun the SELECT query and check if the execution time is reduced from 42 seconds.

tsdb=> select * from product_orders where order_id = 50000000;
order_id | order_date | customer_id | product_id | quantity | order_total | shipping_address | payment_status | order_status
----------+------------------------+-------------+------------+----------+-------------------+------------------------------+----------------+--------------
50000000 | 2117-01-24 12:33:00+00 | 515 | 14 | 9 | 61.00540537187403 | 123 Example St, Example City | Completed | Shipped
(1 row)
Time: 9684.318 ms (00:09.684)

Great! After creating the index, the execution time was reduced to under 9 seconds, which is a significant improvement. Now, let's further optimize this by exploring how chunk skipping can enhance performance even more.

Enable Chunk-Skipping Index

To take advantage of the chunk-skipping index, we first need to enable chunk skipping on the table and then compress it. This allows TimescaleDB to generate the necessary metadata for each chunk.

ALTER TABLE product_orders SET (timescaledb.compress);
SELECT compress_chunk(show_chunks('product_orders'));
SELECT enable_chunk_skipping('product_orders', 'order_id');

After enabling chunk skipping and enabling columnstore (which compresses data), let's rerun the same SELECT query to observe the performance improvement.

select * from product_orders where order_id = 50000000;
order_id | order_date | customer_id | product_id | quantity | order_total | shipping_address | payment_status | order_status
----------+------------------------+-------------+------------+----------+-------------------+------------------------------+----------------+--------------
50000000 | 2117-01-24 12:33:00+00 | 515 | 14 | 9 | 61.00540537187403 | 123 Example St, Example City | Completed | Shipped
(1 row)
Time: 304.133 ms

Wow! The query now executes in just 304 ms , resulting in a 99.28 % improvement compared to the initial execution time without an index and a 96.86 % performance boost compared to the PostgreSQL index. That's a significant difference!

Query Optimization Method	Execution Time	Performance Improvement (vs. No Index)
No Index	42,049 ms (≈42 sec)	Baseline
With B-tree Index	9,684 ms (≈9.7 sec)	77% faster
With Chunk-Skipping Index + Columnstore (Compression)	304 ms (0.3 sec)	99.28% faster

In conclusion, using TimescaleDB's key features—like hypertables, columnstore, and chunk-skipping indexes—can greatly improve PostgreSQL performance:

Hypertables help you manage large amounts of data more easily while keeping everything organized.
Columnstore reduces storage space and speeds up your queries by cutting the amount of data that needs to be read.
Chunk-skipping indexes also accelerate query performance by ignoring unnecessary data.

Together, these features make it easier to work with time-series data, events, and real-time analytics. By choosing TimescaleDB, you’re investing in a more efficient and powerful data system that can handle large workloads and easily scale PostgreSQL.

To get started, sign up for a free Timescale Cloud account.

DEV Community

Handling Billions of Rows in PostgreSQL

Benchmarking Uncompressed Table

Step 1: Create a table

Step 2: Create an index

Step 3: Ingest data

Step 4: Running aggregate queries

Query 1

Query 2

Query 3

Benchmarking Compressed Hypertable

Step 1: Create a table

Step 2: Create an index

Step 3: Convert to hypertable

Step 4: Enable columnstore / compression

Step 5: Add compression policy

Step 6: Ingest data

Step 7: Running aggregate queries

Query 1

Query 2

Query 3

Key takeaways

Chunk-Skipping in Timescale

Convert to hypertable

Ingest data

Add index

Enable Chunk-Skipping Index

Top comments (0)

Read next

Exploring new AWS EKS auto mode. What is it ? Why it is useful ? How to quick start ?

A Guide to Manage Access in SQL - GRANT, REVOKE, and Access Control

Why Are So Many Programmers Drawn to Religion? A Curious Exploration

Auto-Investing: Simplifying Crypto Portfolio Management