As industries like IoT, finance, and healthcare increasingly generate tremendous amounts of data, scalable and efficient cloud storage solutions have become essential. Developers often debate when to use PostgreSQL vs. specialized solutions due to scalability concerns, but it’s possible to engineer PostgreSQL to handle these challenging workloads efficiently and at a petabyte scale—whether it’s events, time series, real-time analytics, or vector data.
Another common concern is PostgreSQL cloud costs. The trade-off between management convenience and cost control is a real consideration for developers—cue in tiered storage. Timescale’s tiered storage architecture offers a practical approach to managing large datasets in PostgreSQL by moving older, less-accessed data to S3. This allows you to save on storage costs while keeping all your data queryable, and without compromising query speed—even for infrequently accessed data.
In this post, we will explore how to get the most out of this architecture—balancing the high-performance tier for recent, frequently accessed data with the low-cost tier for older, less critical data, while weighing any trade-offs so you can scale your PostgreSQL database with confidence.
Reducing PostgreSQL Cloud Costs: Tiered Storage Architecture
Timescale's tiered storage system is built to handle the explosive growth of time-series data and other demanding workloads, such as vector data or real-time analytics. Its architecture splits data into high-performance storage (frequently accessed) and low-cost storage (infrequently accessed) tiers, allowing you to optimize both PostgreSQL costs and performance.
1. High-performance tier
This tier stores your recent, frequently accessed data on high-performance disks. It ensures low-latency, high-throughput reads and writes, making it ideal for workloads that require real-time access. Recent and frequently accessed data remains performant due to PostgreSQL’s optimized indexing, caching, and Timescale’s time-partitioned hypertables.
2. Low-cost tier
The low-cost tier is designed for older, less accessed data. Using object storage like AWS S3 significantly reduces storage costs, while still remaining an integral part of your database. It comes with trade-offs though: while old data remains fully queryable, you may experience slower response times compared to the high-performance tier. This is one aspect we have recently improved, optimizing the query performance for tiered data by 400x so you can run complex queries across both tiers.
How Tiered Storage Helps You Scale PostgreSQL
Scalability
The tiered storage system offers virtually unlimited data storage, enabling you to handle databases that grow into petabytes without skyrocketing costs. As your time-series data grows, older data can be automatically moved to the low-cost tier without requiring database restructuring or external ETL (extract-load-transform) processes.
Cost efficiency
Storing data in S3-like object stores is significantly cheaper than keeping all data on high-performance disks. For instance, Timescale's low-cost storage layer offers a flat price of $0.021 per GB/month , cheaper than Amazon S3’s typical rate. When your dataset grows, this cost-saving benefit compounds.
It is important to note here that—as opposed to the high-performance tier—pricing in the low-cost tier is calculated per uncompressed gigabyte.
Transparent querying
Regardless of where your data is stored, Timescale’s architecture makes querying seamless. SQL queries are designed to access both tiers without needing special tools or database management changes. Timescale's query planner automatically fetches data from the relevant tier based on your time-range query, maintaining a unified interface.
Please note that in order to make querying tiered data seamless, Timescale provides a GUC (grand unified configuration)—timescaledb.enable_tiered_reads
—which needs to be set to true. It is false by default, which means that by default, Timescale does not touch tiered data. More in the docs: Querying tiered data.
Performance Trade-Offs
Understanding the trade-offs between high-performance storage and object storage like S3 is essential for managing performance in a tiered storage environment.
Query speed
PostgreSQL is optimized for real-time queries with sub-millisecond response times. Due to the nature of object storage, query latency increases for old data stored in S3. However, this trade-off is acceptable for many analytical workloads or infrequently accessed data. Timescale ensures that queries spanning both tiers are handled efficiently, but the speed of the high-performance tier cannot be matched by S3.
A very interesting GUC provided by Timescale is timescaledb.enable_tiered_reads
. If this is set to "false," the queries only fetch data from high-performance tier while ignoring data in the low-cost storage, like it doesn’t exist. If it is okay for your use case to ignore data in the low-cost (and low performance) storage, setting this GUC to "false" is one way ensuring query speeds.
Storage costs
The cost difference between high-performance storage and low-cost S3 storage is significant. As a result, storing large datasets that may not be frequently accessed in the high-performance tier is costly. Tiering older data to S3 can reduce this burden, lowering your overall operational costs.
Immutability vs. flexibility
PostgreSQL allows real-time inserts, updates, and deletes, which is critical for most OLTP and real-time analytics applications. Timescale’s high-performance tier also allows these operations over compressed data. S3’s immutable nature, however, limits flexibility. Data tiered to the low-cost S3 is best for archival or append-only datasets where modifications are not expected.
How to Optimize Queries Across Both Tiers
When querying data that spans both tiers, TimescaleDB uses its query planner to seamlessly fetch data from both tiers. However, query performance can degrade if queries are not optimized. Here's how you can improve this:
Use LIMIT and ORDER BY for partial results
Queries that span both tiers can be optimized by retrieving partial results instead of scanning the entire dataset. Using LIMIT
along with ORDER BY
reduces the overhead by fetching only the required number of rows.
Example
SELECT * FROM conditions
WHERE time_column BETWEEN '2024-01-01' AND '2024-06-01'
ORDER BY time_column DESC
LIMIT 100;
Explanation
- This query fetches the latest 100 data points from the specified time range. By ordering the results and limiting the rows, the query retrieves only the most recent data (from high-performance storage), potentially avoiding low-cost tier queries unless necessary.
Chunk exclusion techniques to prevent unnecessary scans
Chunk exclusion is one of Timescale’s most powerful features for improving query performance. It ensures that the database only scans the necessary chunks of data and avoids querying irrelevant chunks, particularly in the low-cost tier.
Timescale automatically partitions data into chunks based on time intervals. Each chunk stores a specific range of time, so by using a WHERE
clause with a time_column
, you can help Timescale exclude chunks that don’t match the query, avoiding unnecessary data scans.
Using a WHERE clause for chunk exclusion
Let's say we want to fetch temperature data only for January 2024. Without chunk exclusion, the database might scan all data (including irrelevant historical data) if it’s not optimized properly.
Optimized query using WHERE for time-based chunk exclusion
SELECT
time_column, temperature
FROM
conditions
WHERE
time_column BETWEEN '2024-01-01' AND '2024-01-31';
Explanation
- Timescale’s query planner knows the time range for each chunk of data and will only scan the chunk(s) that contain data for January 2024.
- Chunks outside this range, regardless of tier, will be automatically excluded, reducing query latency.
Combining time_bucket and WHERE clause
You can combine the time_bucket
function with a specific time range to both aggregate data and enable chunk exclusion.
SELECT
time_bucket('1 hour', time_column) AS bucket,
avg(temperature) AS avg_temp
FROM
conditions
WHERE
time_column BETWEEN '2024-03-01' AND '2024-03-31'
GROUP BY
bucket;
Explanation
- This query aggregates data into one-hour intervals for the month of March 2024.
- Timescale’s chunk exclusion mechanism ensures that only the relevant chunks containing March 2024 data are scanned, avoiding any unnecessary chunks from both the high-performance and low-cost tiers.
Handling Immutability in Low-Cost Storage: Strategies and Workarounds
One of the key trade-offs of using S3 as low-cost storage in Timescale is immutability—once data is moved to low-cost storage, it cannot be modified directly. While this immutability ensures data stability and durability, it can become a challenge when you need to update historical data for backfilling or corrections.
This section will discuss strategies for handling immutability, including chunk rehydration, backfilling historical data, and using staging updates in the high-performance tier.
Backfilling historical data via chunk rehydration
When historical data needs to be updated (e.g., for correcting errors or adding missing data), Timescale provides a mechanism called chunk rehydration. This process allows you to temporarily bring old data back into the high-performance tier, where it can be modified, and then retier it by moving it back to low-cost storage.
Workflow for backfilling with chunk rehydration
- Identify the chunk : First, identify the chunk that contains the historical data you need to modify. You will need to run a query like the following to get the correct chunk name:
SELECT chunk_name FROM timescaledb_information.chunks WHERE hypertable_name = 'conditions';
-
Rehydrate the chunk : Use the
untier_chunk
function to bring the compressed old data from S3 back into the high-performance storage. - Modify the data : Once the chunk is rehydrated, you can perform any necessary updates or backfill the missing data.
- Recompress and retier the chunk : After performing the updates, the data is compressed again and moved back to low-cost storage.
Example of rehydrating and backfilling a chunk
-- Bring the chunk back from low-cost storage to high-performance storage for modification
CALL untier_chunk('_timescaledb_internal._hyper_1_1_chunk');
-- Perform your data updates or inserts
UPDATE conditions
SET temperature = 22.5
WHERE time_column BETWEEN '2024-01-01' AND '2024-01-15';
-- After modifications, the chunk will be recompressed and sent back to low-cost storage
CALL retier_chunk('_timescaledb_internal._hyper_1_1_chunk');
Explanation
- The
untier_chunk
function temporarily moves the chunk back to the high-performance tier for updates. - You can then modify the data as needed using regular SQL operations.
- Once the modifications are complete, the
retier_chunk
function recompresses the chunk and moves it back to low-cost storage.
When to use chunk rehydration
- Backfilling missing data : when historical data needs to be updated to fill in missing records (e.g., due to sensor downtime in IoT applications).
- Correcting historical errors : fixing errors in older data that was previously archived to low-cost storage.
- Compliance updates : modifying data to meet regulatory requirements, especially when historical records need corrections.
Staging updates in the high-performance tier
If you need to perform more frequent updates on old data or want to avoid repeated rehydration of chunks, another effective strategy is to stage updates in the high-performance tier before moving the data to low-cost storage. This approach is particularly useful when backfilling or modifying a large amount of historical data.
Workflow for staging updates
- Move data temporarily to the high-performance tier : When you anticipate updates, keep the data in the high-performance tier for a longer duration. This will make it easier to modify without moving it back and forth.
- Perform updates or backfilling : Execute the updates or backfill operations while the data remains in the high-performance storage.
- Recompress and tier the data : Once all necessary updates have been made, recompress and move the data to the low-cost tier.
Example of staging data for updates
-- Extend the retention period for high-performance storage to allow for more updates
SELECT add_tiering_policy('conditions', INTERVAL '12 months');
-- Perform any updates or backfilling on recent data
UPDATE conditions
SET temperature = temperature + 1.0
WHERE time_column BETWEEN '2024-01-01' AND '2024-01-31';
-- Once updates are complete, move the data back to low-cost storage after the updated retention period
SELECT add_tiering_policy('conditions', INTERVAL '6 months');
Explanation
- By extending the retention policy for the high-performance tier, you can allow data to remain modifiable for a longer period.
- After performing the required updates, you can revert the retention period to move the data back to low-cost storage.
When to use staging in high-performance tier
- Frequent backfills : If you know that a specific dataset might need frequent updates, keeping it in the high-performance tier can avoid the overhead of constant rehydration and retiering.
- Time-dependent updates : Some updates may need to happen at regular intervals (e.g., quarterly data adjustments). Staging the data in the high-performance tier before moving it to low-cost storage ensures all updates are completed before finalizing.
Workarounds for immutability conflicts
In cases where immutability conflicts with data maintenance needs, there are a few additional strategies you can apply to balance the benefits of low-cost storage with the ability to modify data.
Using high-performance-only tables for staging and merging
Instead of moving data back and forth between high-performance and low-cost tiers, you can create temporary high-performance-only tables to stage data updates before merging them with historical data.
Workflow for staging and merging:
- Create a temporary staging table : insert new or updated records into a separate table that resides entirely in the high-performance tier.
- Merge with historical data : once the staging table has all the updates, merge the new data into the main hypertable, rehydrating chunks if necessary.
Example
-- Extend the retention period for high-performance storage to allow for more updates
SELECT add_tiering_policy('conditions', INTERVAL '12 months');
-- Perform any updates or backfilling on recent data
UPDATE conditions
SET temperature = temperature + 1.0
WHERE time_column BETWEEN '2024-01-01' AND '2024-01-31';
-- Once updates are complete, move the data back to low-cost storage after the updated retention period
SELECT add_tiering_policy('conditions', INTERVAL '6 months');
Benefits
- Separation of concerns : the staging table allows you to manage updates independently of the low-cost storage, ensuring immutability is maintained until necessary.
- Minimizes chunk rehydration : by collecting all updates in a high-performance-tier staging table first, you minimize the need for constant chunk rehydration.
Appending new data instead of updating
If modifications are minimal and mostly involve adding new information rather than altering existing records, consider designing your data model to favor append-only operations. This aligns well with the immutability of low-cost storage, as you won’t need to modify the existing data.
Example
-- Append new temperature readings as new rows rather than modifying existing ones
INSERT INTO conditions (time_column, temperature, humidity)
VALUES ('2024-01-15', 22.5, 45.2);
When to use
- Audit trails : in situations where keeping a historical record of all changes is important (e.g., audit trails), appending new data instead of updating is a natural fit.
- Avoiding data mutability : by not altering existing records, you ensure the low-cost storage remains immutable and untouched.
Best Practices for Using Tiered Storage
1. Set appropriate data retention policies
One of the most important aspects of leveraging tiered storage is setting well-thought-out data retention and tiering policies. Timescale allows you to move older data to low-cost storage after a specified period using simple SQL commands. For instance, you can automate the movement of all data older than six months to low-cost storage:
SELECT add_tiering_policy ('your_hypertable', INTERVAL '6 months');
This policy automatically shifts the data after six months, ensuring the high-performance tier remains optimized for recent queries. Implementing intelligent policies helps prevent overburdening the high-performance tier and keeps performance high.
2. Monitor query patterns and optimize tier shifting
Not all older data may need to move immediately to low-cost storage. For workloads that intermittently access historical data for compliance or analysis, you may want to fine-tune tiering policies based on query patterns. Understanding which data is critical for real-time analysis helps you optimize storage costs while maintaining efficient performance.
For example, financial institutions often need to access last year’s data for audits or compliance, but the data doesn't need to reside in the high-performance tier at all times. Query monitoring tools within Timescale can give you insights into data access frequency, which helps you make informed decisions on when to move data to the low-cost tier.
3. Compression for high-performance tier data
Compression can substantially reduce the storage footprint for high-performance data. Timescale’s native hybrid-row columnar storage engine reduces disk space usage by 90 % or more, which lowers storage costs even for the high-performance tier. Data in the high-performance tier can be kept compressed for longer while still ensuring fast query response times, allowing more efficient use of the more expensive PostgreSQL storage. This is especially useful for time-series data that often exhibits high redundancy.
Example:
ALTER TABLE your_hypertable SET (timescaledb.compress = true);
4. Plan for the S3 trade-offs
While Timescale enables transparent access to data in the low-cost tier, it’s important to consider S3's characteristics:
Immutability
Once data is tiered to low-cost storage, it cannot be modified. This is a fundamental trade-off of using object stores like S3. Plan accordingly and ensure that data you no longer expect to modify goes into low-cost storage.
Higher latency
Accessing old data from S3 involves additional latency compared to querying high-performance tier’s recent data. Timescale mitigates this with efficient query planning, but it's worth noting that certain queries, especially those that span both tiers, may see increased response times. Design your applications to tolerate slightly higher latency for historical queries.
5. Untier chunks for backfilling or updates
If you need to update historical data that resides in low-cost storage, Timescale provides a mechanism to untier specific data chunks, allowing you to bring them back to high-performance storage temporarily. This can be useful for cases where backfilling data is necessary.
Example:
CALL untier_chunk('_Timescale_internal._hyper_1_1_chunk');
6. Optimize queries for tiered data
When querying across both recent and old data, ensure your queries are optimized to avoid unnecessary scans across the entire dataset. Timescale’s chunk exclusion feature automatically excludes unnecessary data chunks from being queried. Additionally, using time_bucket
and other SQL functions effectively can further optimize performance.
7. Leverage data replication smartly
Timescale’s replication mechanisms ensure that data, whether in the high-performance tier or the low-cost tier, is stored securely across multiple nodes. If you’re using read replicas or forks for testing, ensure that you are using the low-cost tier efficiently across all instances. Because Timescale only charges once for data stored in the low-cost tier, replicated data remains cost-efficient.
Conclusion
Timescale’s tiered storage unlocks both the performance of PostgreSQL and the cost savings of cloud-based object storage like S3, making it an ideal solution for time-series data and challenging workloads at scale. By effectively managing tiering policies, leveraging compression, and understanding the trade-offs between high-performance and low-cost data storage, you can optimize your Timescale deployment to handle growing datasets with ease while keeping cloud costs in check.
Incorporating these best practices will help you strike the right balance between performance and cost, ensuring that your time-series workloads remain efficient and scalable for the long term. For further reading, make sure to read this article on Scaling PostgreSQL to Petabyte Scale using Timescale’s tiered storage architecture.
Tiered storage is available only in Timescale’s PostgreSQL cloud platform, Timescale Cloud. Start for free here (no credit card required, free for 30 days).
Top comments (0)