Team TigerData for Tiger Data (Creators of TimescaleDB)

Posted on Feb 3 • Originally published at timescale.com on Jan 17

How to Optimize PostgreSQL Cloud Costs With Tiered Storage

#postgres #tutorial #database #devops

As industries like IoT, finance, and healthcare increasingly generate tremendous amounts of data, scalable and efficient cloud storage solutions have become essential. Developers often debate when to use PostgreSQL vs. specialized solutions due to scalability concerns, but it’s possible to engineer PostgreSQL to handle these challenging workloads efficiently and at a petabyte scale—whether it’s events, time series, real-time analytics, or vector data.

Another common concern is PostgreSQL cloud costs. The trade-off between management convenience and cost control is a real consideration for developers—cue in tiered storage. Timescale’s tiered storage architecture offers a practical approach to managing large datasets in PostgreSQL by moving older, less-accessed data to S3. This allows you to save on storage costs while keeping all your data queryable, and without compromising query speed—even for infrequently accessed data.

In this post, we will explore how to get the most out of this architecture—balancing the high-performance tier for recent, frequently accessed data with the low-cost tier for older, less critical data, while weighing any trade-offs so you can scale your PostgreSQL database with confidence.

Reducing PostgreSQL Cloud Costs: Tiered Storage Architecture

Timescale's tiered storage system is built to handle the explosive growth of time-series data and other demanding workloads, such as vector data or real-time analytics. Its architecture splits data into high-performance storage (frequently accessed) and low-cost storage (infrequently accessed) tiers, allowing you to optimize both PostgreSQL costs and performance.

1. High-performance tier

This tier stores your recent, frequently accessed data on high-performance disks. It ensures low-latency, high-throughput reads and writes, making it ideal for workloads that require real-time access. Recent and frequently accessed data remains performant due to PostgreSQL’s optimized indexing, caching, and Timescale’s time-partitioned hypertables.

2. Low-cost tier

The low-cost tier is designed for older, less accessed data. Using object storage like AWS S3 significantly reduces storage costs, while still remaining an integral part of your database. It comes with trade-offs though: while old data remains fully queryable, you may experience slower response times compared to the high-performance tier. This is one aspect we have recently improved, optimizing the query performance for tiered data by 400x so you can run complex queries across both tiers.

How Tiered Storage Helps You Scale PostgreSQL

Scalability

The tiered storage system offers virtually unlimited data storage, enabling you to handle databases that grow into petabytes without skyrocketing costs. As your time-series data grows, older data can be automatically moved to the low-cost tier without requiring database restructuring or external ETL (extract-load-transform) processes.

Cost efficiency

Storing data in S3-like object stores is significantly cheaper than keeping all data on high-performance disks. For instance, Timescale's low-cost storage layer offers a flat price of $0.021 per GB/month , cheaper than Amazon S3’s typical rate. When your dataset grows, this cost-saving benefit compounds.

It is important to note here that—as opposed to the high-performance tier—pricing in the low-cost tier is calculated per uncompressed gigabyte.

Transparent querying

Regardless of where your data is stored, Timescale’s architecture makes querying seamless. SQL queries are designed to access both tiers without needing special tools or database management changes. Timescale's query planner automatically fetches data from the relevant tier based on your time-range query, maintaining a unified interface.

Please note that in order to make querying tiered data seamless, Timescale provides a GUC (grand unified configuration)—timescaledb.enable_tiered_reads—which needs to be set to true. It is false by default, which means that by default, Timescale does not touch tiered data. More in the docs: Querying tiered data.

Performance Trade-Offs

Understanding the trade-offs between high-performance storage and object storage like S3 is essential for managing performance in a tiered storage environment.

Query speed

PostgreSQL is optimized for real-time queries with sub-millisecond response times. Due to the nature of object storage, query latency increases for old data stored in S3. However, this trade-off is acceptable for many analytical workloads or infrequently accessed data. Timescale ensures that queries spanning both tiers are handled efficiently, but the speed of the high-performance tier cannot be matched by S3.

A very interesting GUC provided by Timescale is timescaledb.enable_tiered_reads. If this is set to "false," the queries only fetch data from high-performance tier while ignoring data in the low-cost storage, like it doesn’t exist. If it is okay for your use case to ignore data in the low-cost (and low performance) storage, setting this GUC to "false" is one way ensuring query speeds.

Storage costs

The cost difference between high-performance storage and low-cost S3 storage is significant. As a result, storing large datasets that may not be frequently accessed in the high-performance tier is costly. Tiering older data to S3 can reduce this burden, lowering your overall operational costs.

Immutability vs. flexibility

PostgreSQL allows real-time inserts, updates, and deletes, which is critical for most OLTP and real-time analytics applications. Timescale’s high-performance tier also allows these operations over compressed data. S3’s immutable nature, however, limits flexibility. Data tiered to the low-cost S3 is best for archival or append-only datasets where modifications are not expected.

How to Optimize Queries Across Both Tiers

When querying data that spans both tiers, TimescaleDB uses its query planner to seamlessly fetch data from both tiers. However, query performance can degrade if queries are not optimized. Here's how you can improve this:

Use LIMIT and ORDER BY for partial results

Queries that span both tiers can be optimized by retrieving partial results instead of scanning the entire dataset. Using LIMIT along with ORDER BY reduces the overhead by fetching only the required number of rows.

Example

SELECT * FROM conditions
WHERE time_column BETWEEN '2024-01-01' AND '2024-06-01'
ORDER BY time_column DESC
LIMIT 100;

Explanation

This query fetches the latest 100 data points from the specified time range. By ordering the results and limiting the rows, the query retrieves only the most recent data (from high-performance storage), potentially avoiding low-cost tier queries unless necessary.

Chunk exclusion techniques to prevent unnecessary scans

Chunk exclusion is one of Timescale’s most powerful features for improving query performance. It ensures that the database only scans the necessary chunks of data and avoids querying irrelevant chunks, particularly in the low-cost tier.

Timescale automatically partitions data into chunks based on time intervals. Each chunk stores a specific range of time, so by using a WHERE clause with a time_column, you can help Timescale exclude chunks that don’t match the query, avoiding unnecessary data scans.

Using a WHERE clause for chunk exclusion

Let's say we want to fetch temperature data only for January 2024. Without chunk exclusion, the database might scan all data (including irrelevant historical data) if it’s not optimized properly.

Optimized query using WHERE for time-based chunk exclusion

SELECT 
  time_column, temperature 
FROM 
  conditions
WHERE 
  time_column BETWEEN '2024-01-01' AND '2024-01-31';

Explanation

Timescale’s query planner knows the time range for each chunk of data and will only scan the chunk(s) that contain data for January 2024.
Chunks outside this range, regardless of tier, will be automatically excluded, reducing query latency.

Combining time_bucket and WHERE clause

You can combine the time_bucket function with a specific time range to both aggregate data and enable chunk exclusion.

SELECT 
  time_bucket('1 hour', time_column) AS bucket, 
  avg(temperature) AS avg_temp
FROM 
  conditions
WHERE 
  time_column BETWEEN '2024-03-01' AND '2024-03-31'
GROUP BY 
  bucket;

Explanation

This query aggregates data into one-hour intervals for the month of March 2024.
Timescale’s chunk exclusion mechanism ensures that only the relevant chunks containing March 2024 data are scanned, avoiding any unnecessary chunks from both the high-performance and low-cost tiers.

Handling Immutability in Low-Cost Storage: Strategies and Workarounds

One of the key trade-offs of using S3 as low-cost storage in Timescale is immutability—once data is moved to low-cost storage, it cannot be modified directly. While this immutability ensures data stability and durability, it can become a challenge when you need to update historical data for backfilling or corrections.

This section will discuss strategies for handling immutability, including chunk rehydration, backfilling historical data, and using staging updates in the high-performance tier.

Backfilling historical data via chunk rehydration

When historical data needs to be updated (e.g., for correcting errors or adding missing data), Timescale provides a mechanism called chunk rehydration. This process allows you to temporarily bring old data back into the high-performance tier, where it can be modified, and then retier it by moving it back to low-cost storage.

Workflow for backfilling with chunk rehydration

Identify the chunk : First, identify the chunk that contains the historical data you need to modify. You will need to run a query like the following to get the correct chunk name:

SELECT chunk_name FROM timescaledb_information.chunks WHERE hypertable_name = 'conditions';

Rehydrate the chunk : Use the untier_chunk function to bring the compressed old data from S3 back into the high-performance storage.
Modify the data : Once the chunk is rehydrated, you can perform any necessary updates or backfill the missing data.
Recompress and retier the chunk : After performing the updates, the data is compressed again and moved back to low-cost storage.

Example of rehydrating and backfilling a chunk

-- Bring the chunk back from low-cost storage to high-performance storage for modification
CALL untier_chunk('_timescaledb_internal._hyper_1_1_chunk');

-- Perform your data updates or inserts
UPDATE conditions
SET temperature = 22.5
WHERE time_column BETWEEN '2024-01-01' AND '2024-01-15';

-- After modifications, the chunk will be recompressed and sent back to low-cost storage
CALL retier_chunk('_timescaledb_internal._hyper_1_1_chunk');

Explanation

The untier_chunk function temporarily moves the chunk back to the high-performance tier for updates.
You can then modify the data as needed using regular SQL operations.
Once the modifications are complete, the retier_chunk function recompresses the chunk and moves it back to low-cost storage.

When to use chunk rehydration

Backfilling missing data : when historical data needs to be updated to fill in missing records (e.g., due to sensor downtime in IoT applications).
Correcting historical errors : fixing errors in older data that was previously archived to low-cost storage.
Compliance updates : modifying data to meet regulatory requirements, especially when historical records need corrections.

Staging updates in the high-performance tier

If you need to perform more frequent updates on old data or want to avoid repeated rehydration of chunks, another effective strategy is to stage updates in the high-performance tier before moving the data to low-cost storage. This approach is particularly useful when backfilling or modifying a large amount of historical data.

Workflow for staging updates

Move data temporarily to the high-performance tier : When you anticipate updates, keep the data in the high-performance tier for a longer duration. This will make it easier to modify without moving it back and forth.
Perform updates or backfilling : Execute the updates or backfill operations while the data remains in the high-performance storage.
Recompress and tier the data : Once all necessary updates have been made, recompress and move the data to the low-cost tier.

Example of staging data for updates

-- Extend the retention period for high-performance storage to allow for more updates
SELECT add_tiering_policy('conditions', INTERVAL '12 months');

-- Perform any updates or backfilling on recent data
UPDATE conditions
SET temperature = temperature + 1.0
WHERE time_column BETWEEN '2024-01-01' AND '2024-01-31';

-- Once updates are complete, move the data back to low-cost storage after the updated retention period
SELECT add_tiering_policy('conditions', INTERVAL '6 months');

Explanation

By extending the retention policy for the high-performance tier, you can allow data to remain modifiable for a longer period.
After performing the required updates, you can revert the retention period to move the data back to low-cost storage.

When to use staging in high-performance tier

Frequent backfills : If you know that a specific dataset might need frequent updates, keeping it in the high-performance tier can avoid the overhead of constant rehydration and retiering.
Time-dependent updates : Some updates may need to happen at regular intervals (e.g., quarterly data adjustments). Staging the data in the high-performance tier before moving it to low-cost storage ensures all updates are completed before finalizing.

Workarounds for immutability conflicts

In cases where immutability conflicts with data maintenance needs, there are a few additional strategies you can apply to balance the benefits of low-cost storage with the ability to modify data.

Using high-performance-only tables for staging and merging

Instead of moving data back and forth between high-performance and low-cost tiers, you can create temporary high-performance-only tables to stage data updates before merging them with historical data.

Workflow for staging and merging:

Create a temporary staging table : insert new or updated records into a separate table that resides entirely in the high-performance tier.
Merge with historical data : once the staging table has all the updates, merge the new data into the main hypertable, rehydrating chunks if necessary.

Example

-- Extend the retention period for high-performance storage to allow for more updates
SELECT add_tiering_policy('conditions', INTERVAL '12 months');

-- Perform any updates or backfilling on recent data
UPDATE conditions
SET temperature = temperature + 1.0
WHERE time_column BETWEEN '2024-01-01' AND '2024-01-31';

-- Once updates are complete, move the data back to low-cost storage after the updated retention period
SELECT add_tiering_policy('conditions', INTERVAL '6 months');

Benefits

Separation of concerns : the staging table allows you to manage updates independently of the low-cost storage, ensuring immutability is maintained until necessary.
Minimizes chunk rehydration : by collecting all updates in a high-performance-tier staging table first, you minimize the need for constant chunk rehydration.

Appending new data instead of updating

If modifications are minimal and mostly involve adding new information rather than altering existing records, consider designing your data model to favor append-only operations. This aligns well with the immutability of low-cost storage, as you won’t need to modify the existing data.

Example

-- Append new temperature readings as new rows rather than modifying existing ones
INSERT INTO conditions (time_column, temperature, humidity)
VALUES ('2024-01-15', 22.5, 45.2);

When to use

Audit trails : in situations where keeping a historical record of all changes is important (e.g., audit trails), appending new data instead of updating is a natural fit.
Avoiding data mutability : by not altering existing records, you ensure the low-cost storage remains immutable and untouched.

Best Practices for Using Tiered Storage

1. Set appropriate data retention policies

One of the most important aspects of leveraging tiered storage is setting well-thought-out data retention and tiering policies. Timescale allows you to move older data to low-cost storage after a specified period using simple SQL commands. For instance, you can automate the movement of all data older than six months to low-cost storage:

SELECT add_tiering_policy ('your_hypertable', INTERVAL '6 months');

This policy automatically shifts the data after six months, ensuring the high-performance tier remains optimized for recent queries. Implementing intelligent policies helps prevent overburdening the high-performance tier and keeps performance high.

2. Monitor query patterns and optimize tier shifting

Not all older data may need to move immediately to low-cost storage. For workloads that intermittently access historical data for compliance or analysis, you may want to fine-tune tiering policies based on query patterns. Understanding which data is critical for real-time analysis helps you optimize storage costs while maintaining efficient performance.

For example, financial institutions often need to access last year’s data for audits or compliance, but the data doesn't need to reside in the high-performance tier at all times. Query monitoring tools within Timescale can give you insights into data access frequency, which helps you make informed decisions on when to move data to the low-cost tier.

3. Compression for high-performance tier data

Compression can substantially reduce the storage footprint for high-performance data. Timescale’s native hybrid-row columnar storage engine reduces disk space usage by 90 % or more, which lowers storage costs even for the high-performance tier. Data in the high-performance tier can be kept compressed for longer while still ensuring fast query response times, allowing more efficient use of the more expensive PostgreSQL storage. This is especially useful for time-series data that often exhibits high redundancy.

Example:

ALTER TABLE your_hypertable SET (timescaledb.compress = true);

4. Plan for the S3 trade-offs

While Timescale enables transparent access to data in the low-cost tier, it’s important to consider S3's characteristics:

Immutability

Once data is tiered to low-cost storage, it cannot be modified. This is a fundamental trade-off of using object stores like S3. Plan accordingly and ensure that data you no longer expect to modify goes into low-cost storage.

Higher latency

Accessing old data from S3 involves additional latency compared to querying high-performance tier’s recent data. Timescale mitigates this with efficient query planning, but it's worth noting that certain queries, especially those that span both tiers, may see increased response times. Design your applications to tolerate slightly higher latency for historical queries.

5. Untier chunks for backfilling or updates

If you need to update historical data that resides in low-cost storage, Timescale provides a mechanism to untier specific data chunks, allowing you to bring them back to high-performance storage temporarily. This can be useful for cases where backfilling data is necessary.

Example:

CALL untier_chunk('_Timescale_internal._hyper_1_1_chunk');

6. Optimize queries for tiered data

When querying across both recent and old data, ensure your queries are optimized to avoid unnecessary scans across the entire dataset. Timescale’s chunk exclusion feature automatically excludes unnecessary data chunks from being queried. Additionally, using time_bucket and other SQL functions effectively can further optimize performance.

7. Leverage data replication smartly

Timescale’s replication mechanisms ensure that data, whether in the high-performance tier or the low-cost tier, is stored securely across multiple nodes. If you’re using read replicas or forks for testing, ensure that you are using the low-cost tier efficiently across all instances. Because Timescale only charges once for data stored in the low-cost tier, replicated data remains cost-efficient.

Conclusion

Timescale’s tiered storage unlocks both the performance of PostgreSQL and the cost savings of cloud-based object storage like S3, making it an ideal solution for time-series data and challenging workloads at scale. By effectively managing tiering policies, leveraging compression, and understanding the trade-offs between high-performance and low-cost data storage, you can optimize your Timescale deployment to handle growing datasets with ease while keeping cloud costs in check.

Incorporating these best practices will help you strike the right balance between performance and cost, ensuring that your time-series workloads remain efficient and scalable for the long term. For further reading, make sure to read this article on Scaling PostgreSQL to Petabyte Scale using Timescale’s tiered storage architecture.

Tiered storage is available only in Timescale’s PostgreSQL cloud platform, Timescale Cloud. Start for free here (no credit card required, free for 30 days).