DEV Community

Cover image for PostgreSQL vs. Specialized Solutions: Evaluating Your Open-Source Database Options
Team TigerData for Tiger Data (Creators of TimescaleDB)

Posted on • Originally published at timescale.com

PostgreSQL vs. Specialized Solutions: Evaluating Your Open-Source Database Options

Time-series databases answer a challenge that more companies increasingly recognize: measuring how systems change.

To analyze time-series data effectively, you must collect data at speed and scale. If you ingest hundreds, thousands, or even more data points per second, they can pile up quickly.

Ultimately, you need a database that can handle the scale, has excellent usability, and is reliable. A general database is unlikely to meet all three criteria for actual time-series workloads. You need a purpose-built time-series database.

In this post, we’ll compare some of the most popular time-series databases and review key features and criteria to consider when choosing one.

What Makes a Time-Series Database the Best?

There are many ways a database can be the best, depending on your preferred optimization points. In this section, we outline the most critical aspects of a time-series database: scalability, maintainability, reliability, and query language.

  • Scalability. Time-series data typically grows rapidly, and the database must accommodate large volumes of data without performance degradation.

  • Maintainability. Maintaining a database involves the ease of performing backups, updates, and other routine tasks to ensure data integrity and availability. This is important as it affects the database's long-term usability and reliability, reducing the time and effort required for administrative tasks.

  • Reliability. Ensuring data accuracy and availability, especially during high demand or failures, is essential. Since time-series databases often support business decisions, they require consistent and dependable data access.

  • Query language. A database's query language determines how users interact with the data, affecting ease of use and the learning curve. A familiar query language like SQL can reduce training time and integrate more easily with existing tools, enhancing productivity.

Top Time-Series Databases in Comparison

Database Model Scalability Deployment Query Language Pricing Models
TimescaleDB Relational database Vertically scalable, with automatic partitioning, columnar compression, optimized queries, and automatically updated materialized views. Horizontally, you can scale your read load on Timescale Cloud by adding read replicas. Self-managed or managed cloud service SQL TimescaleDB open-source can be self-hosted and is free. Timescale Cloud follows a pay-as-you-go model
InfluxDB Custom, non-relational NoSQL, columnar database Horizontally scalable Self-managed or managed cloud service SQL, InfluxQL, Flux InfluxDB Open Source is free to use, and InfluxDB Cloud Serverless is a managed service with a pay-as-you-go pricing model. InfluxDB Cloud Dedicated and InfluxDB Clustered are for high-volume production workloads with costs depending on storage, CPU, and RAM
Prometheus Pull-based model that scrapes metrics from targets Scalable vertically or through federation Deployed as single binary on server or on container platforms such as Kubernetes PromQL Open-source: no associated license cost
Kdb+ Columnar database with a custom data model Horizontally scalable with multi-node support and multi-threading On-premises, in the cloud, or hybrid Q language Free 32-bit version for non-commercial purposes. For commercial deployments, pricing depends on the deployment model & number of cores
Graphite Whisper (file-based time series) database format Horizontally scalable, supports replication and clustering On-premises or in the cloud GQL Open-source: no associated licensing cost
ClickHouse Columnar database Horizontally scalable On-premises or in the cloud. Also available as a managed service SQL-based declarative query language, mostly identical to ANSI SQL standard ClickHouse is open-source and doesn't have an associated license cost. ClickHouse Cloud follows a pay-as-you-go pricing model
MongoDB No-SQL database with a JSON-like document model Horizontally scalable - supports automatic load balancing, data sharding, and replication Self-managed or managed cloud service MQL (MongoDB Query Language) MongoDB Community Edition is open-source and free to use. MongoDB Enterprise: pricing depends on the features you choose. MongoDB Atlas has a pay-as-you-go pricing model

InfluxDB

InfluxDB is a popular time-series database known for its high performance and scalability. Built from the ground up for time-series data, it offers a schema-less data model, which makes it easy to add new fields without modifying the schema or migrating data. Depending on which version you're using, it uses specialized query languages called InfluxQL and Flux (you can also rely on standard SQL).

InfluxDB can handle high-throughput ingestion of metrics data and is widely used in monitoring and alerting, IoT data storage and analysis, and real-time analytics. It’s also suited for applications that involve analyzing and tracking data points over time and for scenarios where users need to query data quickly after ingestion for real-time analytics.

The database uses columnar storage for data, Apache Arrow for in-memory data representation, and Parquet for file storage, which allows for faster queries and better data compressions. Plus, it allows you to define retention policies that determine how long your data is stored before it is automatically deleted.

What sets InfluxDB apart is its architecture which involves four key components (data ingest, data compaction, data querying, and garbage collection), each operating independently. With the query and ingest components decoupled, it is easy to independently scale each component depending on the workload demands. InfluxDB's clustering capabilities also enable horizontal scalability, allowing it to scale with growing data volumes and workload demands.

Prometheus

Prometheus is a monitoring and alerting toolkit known for its reliability and scalability. It is optimized for collecting and querying time-series data related to system and application metrics. Prometheus uses a pull-based model where the server collects metrics from the target at regular intervals and stores data in a custom time-series database that’s optimized for low resource usage and fast querying. This model makes it highly suitable for monitoring dynamic environments since it enables automatic discovery and allows monitoring of new instances.

Prometheus has an extensible and modular architecture with components such as service discovery mechanisms and exporters. However, since it’s a non-distributed system, it isn’t horizontally scalable and lacks built-in clustering, although it supports federation, which allows multiple servers to share data.

Its query language, PromQL, allows users to perform queries and aggregations on metric data. Prometheus is widely used in cloud-native environments and is known for its simplicity, flexibility, and robustness. It integrates with container orchestration platforms like Kubernetes, making it a popular choice for monitoring microservices architectures, including Docker and Kubernetes. Other use cases included application performance monitoring and anomaly detection.

However, note that Prometheus is not a general-purpose time-series database, so it might not be suitable for long-term data storage.

Kdb+

kdb+ is a columnar database optimized for handling financial time-series data such as stock market trades, although it works for other kinds of time-series data, including IoT and sensor data. Developed by Kx Systems, kdb+ is known for its speed, scalability, and advanced analytics capabilities, making it suitable for applications that need high-speed querying and analysis of high-volume data. However, it uses a proprietary language called q for querying and manipulating data, which is tailored for working with time-series data in finance.

The database is suited to scenarios where low-latency access to large volumes of time-series data is critical, such as algorithmic trading, risk management, and tick data analysis. Its architecture allows for on-disk and in-memory storage, and it can scale horizontally across numerous machines. Its data storage and processing make it a niche choice for financial institutions and trading firms seeking to gain insights from real-time market data.

Timescale

TimescaleDB is a powerful open-source time-series database built on top of PostgreSQL. It offers scalability, reliability, and advanced time-series capabilities. It automatically partitions time-series data using chunks and hypertables, which improves query performance and simplifies data management.

Timescale's compression offers significant storage savings by reducing the footprint of time-series data, often achieving up to 90 percent compression rates. This not only lowers storage costs but also enhances query performance, as less data needs to be read from disk, leading to faster analytics and more efficient resource utilization.

With TimescaleDB, you can leverage the familiarity and power of SQL to query and manipulate time-series data, enabling complex analytics and insights. It also provides specialized SQL operators and functions for different scenarios, like first, last, and time_bucket, that simplify aggregation and querying.

Timescale's continuous aggregates optimize query performance by precomputing and materializing the results of time-series queries at regular intervals. This allows for instantaneous access to up-to-date summaries and insights, reducing the computational overhead on your database and ensuring faster, more efficient data analysis.

Its integration with PostgreSQL also ensures data durability, ACID compliance, and compatibility with a wide range of tools and applications. Whether it's for monitoring, IoT, financial analysis, or other time-series use cases, TimescaleDB provides a robust solution for storing and analyzing time-series data at scale.

All of these features make Timescale ideal for storing and analyzing metrics such as application logs, sensor data, server performance metrics, and financial data like trading volumes, exchange rates, and stock prices.

Graphite

Graphite is an open-source monitoring tool and time-series database used for collecting, storing, and visualizing metric data. It uses the Whisper format, which efficiently manages and stores data by aggregating and expiring data based on retention policies, and a simple yet powerful query language, making it easy to ingest and analyze time-series data. It also supports a range of functions for querying, aggregating, and transforming time series data, allowing users to create custom dashboards and graphs.

The database’s architecture consists of numerous components, including Carbon (which receives metrics from various sources, caches them in memory, and stores them in Whisper), Whisper (which manages data aggregation and retention), and Graphite-web (a web app that provides a UI for visualizing and querying time series data).

Graphite is commonly used for monitoring infrastructure and application metrics in environments ranging from small-scale deployments to large-scale systems. It provides customizable dashboards and graphs for visualizing metric trends and anomalies, helping users gain insights into the performance and health of their systems. While Graphite has limitations in scalability compared to some other solutions, its simplicity and versatility make it a popular choice for many monitoring use cases.

ClickHouse

ClickHouse is an open-source columnar database management system optimized for analytics workloads. It excels in performing fast analytical queries on large volumes of data, and stores data in a column-oriented format, which enables efficient compression and processing for analytical queries (since only the necessary columns are read as the query is executed).

The database's native support for SQL allows you to leverage your existing SQL skills and tools for querying and analyzing data. It is also highly scalable and can handle petabytes of data across distributed clusters. Plus, it uses MergeTtree as its primary table engine, which supports indexing, replication, and partitioning. It also supports materialized views, which improves query performance.

ClickHouse is commonly used for ad hoc analytics, real-time reporting, and data warehousing applications where fast query performance is essential. Although it’s not optimized for working with time-series data, it can still effectively store and analyze it. And while it can quickly query time-series data once ingested, it can struggle with high-write scenarios where the data should be ingested in small batches.

MongoDB

MongoDB is a NoSQL database that is commonly known for its ease of use, scalability, and flexibility. It stores data in JSON-like documents, making it suitable for various use cases, including real-time analytics, content management, and mobile applications. Despite being a general-purpose NoSQL database, MongoDB can efficiently process and store time-series data. Its flexible data model easily adapts to the evolving structure of time-series data.

MongoDB's horizontal scalability allows it to handle large volumes of data and high-throughput workloads with ease. And its query language, MongoDB Query Language (MQL), supports a rich set of operations for querying and manipulating document data.

Time-Series Database Features to Consider

Time-series databases may provide general features as well as time-series features (e.g., TimescaleDB) or focus on providing time-series features at the cost of supporting more general workloads (e.g., InfluxDB). Time-series data doesn’t live in a silo (you often want to correlate it with other types of data), so whatever the approach, you will need a way to relate general and time-series data.

In the following section, you will see examples of time-series features you can expect from a time-series database. To show you both ends of the spectrum, you’ll see what the features look like in two totally different time-series databases: InfluxDB (with its own “Flux” query language) and TimescaleDB (which extends PostgreSQL and offers full-SQL support).

Time-based aggregation

Time-based aggregation is a must-have feature for time-series databases. Without it, you won’t be able to create hourly, daily, weekly, and monthly time aggregations (and many permutations in between) that are key for analyzing time-series data. Let’s see what time-based aggregate queries look like in the case of InfluxDB and TimescaleDB:

InfluxDB

from(bucket: "crypto")
  |> range(start: 0)
  |> aggregateWindow(every: 1d, fn: count, createEmpty: false)
  |> sort(columns: ["_time", ], desc:true)
  |> limit(n:10)
Enter fullscreen mode Exit fullscreen mode

Using InfluxDB’s query language, Flux, you must define a time range for the query. You can use 0 as the start parameter to work around this—in case you don’t want to define a specific time range. Then you can use the aggregateWindow()function to create arbitrary time “buckets.”

TimescaleDB

SELECT
  time_bucket('1 day', time) AS bucket,
  count(*) row_count
FROM crypto_ticks
GROUP BY bucket
ORDER BY bucket DESC
LIMIT 10

bucket          |row_count|
-------------------+---------+
2022-09-09 02:00:00|    52277|
2022-09-08 02:00:00|   128657|
2022-09-07 02:00:00|   134849|
2022-09-06 02:00:00|   132837|
2022-09-05 02:00:00|   126254|
Enter fullscreen mode Exit fullscreen mode

In TimescaleDB you can use the time_bucket() function to create arbitrary time buckets. Besides, all the other available PostgreSQL functions, like count(*) works the same way as in regular PostgreSQL.

When working with time-series data, you will have to create many time-based aggregations, so make sure that the database you choose provides a simple and intuitive interface for creating time buckets.

Automatic downsampling

Time-series data is often ingested at a very high resolution (e.g., thousands of data points per second). To make it easier to analyze time series, users often downsample their data (e.g., they convert thousands of data points per second to only one). This technique not only saves storage costs because lower-resolution data needs to be stored, but it also makes it easier to create visualizations and recognize trends in the data.

Downsampling is often done repeatedly and continuously, which means that if, for example, you insert multiple new rows every second, the database rolls up the incoming data into larger buckets automatically. Instead of aggregating the raw data yourself, the database takes care of it automatically and in real time. Let’s see how InfluxDB and TimescaleDB handle downsampling with an OHLC example.

InfluxDB

close=from(bucket: "crypto")
  |> range(start: -30d)
  |> group(columns:["symbol"])
  |> filter(fn: (r) => r["_measurement"] == "ohlc")
  |> window(every: 1h)
  |> reduce(fn: (r, accumulator) => ({
Enter fullscreen mode Exit fullscreen mode
      indexLow:
        if (r._field=="low") then 
          accumulator.indexLow+1 
        else
        accumulator.indexLow,
Enter fullscreen mode Exit fullscreen mode
      indexOpen: 
      if (r._field=="open") then 
       accumulator.indexOpen+1 
       else 
       accumulator.indexOpen,
        open: 
      if (r._field=="open") then 
        if (accumulator.indexOpen==0) then 
          r._value 
        else 
          accumulator.open
      else
        accumulator.open  
    , 

Enter fullscreen mode Exit fullscreen mode
      high:
       if (r._field=="high") then  
          if(r._value>accumulator.high ) then
            r._value
          else
            accumulator.high 
      else 
        accumulator.high
   ,
Enter fullscreen mode Exit fullscreen mode
    low: 
      if (r._field=="low") then
          if(r._value<accumulator.low or accumulator.indexLow==0.0) then
            r._value
          else
           accumulator.low 
      else 
        accumulator.low,
Enter fullscreen mode Exit fullscreen mode
             close: 
       if (r._field=="close") then 
          r._value 
      else 
        accumulator.close,
Enter fullscreen mode Exit fullscreen mode
             volume: 
        if (r._field=="volume") then
          r._value+accumulator.volume 
          else
           accumulator.volume        
    }),
    identity: {indexLow:0,indexOpen:0,open: 0.0,high: 0.0,low: 0.0,close: 0.0,volume: 0.0})
    |> drop(columns: ["indexOpen","indexLow"])
  |> group(columns:["pair"])
    |> yield(name: "candle")
Enter fullscreen mode Exit fullscreen mode

InfluxDB’s Flux provides a convenient way to write simple queries, but if you want to create somewhat more complex queries, like creating OHLC aggregates from raw financial tick data, the final query can become quite long as you can see.

TimescaleDB

CREATE MATERIALIZED VIEW hourly_buckets
WITH (timescaledb.continuous)
AS
SELECT
  time_bucket('1 hour', time) AS bucket,
  symbol,
  first(price, time) AS open,
  max(price) AS high,
  min(price) AS low,
  last(price, time) AS close
FROM crypto_ticks
GROUP BY bucket, symbol;
Enter fullscreen mode Exit fullscreen mode
SELECT * FROM hourly_buckets;
bucket          |symbol  |open   |high   |low   |close  |
-------------------+--------+-------+-------+-------+-------+
2022-02-08 22:00:00|ADA/USD |  1.166|   1.17|  1.157|  1.168|
2022-02-08 22:00:00|ATOM/USD|  30.44|  30.63|   30.3|  30.51|
2022-02-08 22:00:00|AVAX/USD|  87.85|   88.0|  86.72|  87.06|
2022-02-08 22:00:00|BNB/USD |  413.5|  416.5|  410.3|  410.3|
2022-02-08 22:00:00|BTC/USD |44192.4|44354.0|43938.6|44185.2|
Enter fullscreen mode Exit fullscreen mode

If you are familiar with PostgreSQL syntax, you can see that the TimescaleDB method is very similar to a PostgreSQL materialized view. However, the mechanism under the hood is different to provide a better developer experience for time-series data by automatically storing pre-aggregated buckets over time, maintaining aggregations when raw data changes, and even returning real-time data.

Querying recent data

You might want to build visual dashboards to display time-series trends or even close to real-time data. For creating trend charts, you can use the previously mentioned downsampling method. But for real-time data, you probably want to see more granular and recent data, e.g., all data points from the past five minutes. Let’s see how you can make this simple request in InfluxDB and TimescaleDB.

InfluxDB

from(bucket: "crypto")
  |> range(start: -5m)
  |> filter(fn: (r) => r.symbol == "BTC/USD")
  |> sort(columns: ["_time", ], desc:true)
  |> limit(n:5)
  |> keep(columns: ["_time", "_value"])
Enter fullscreen mode Exit fullscreen mode

In Flux, you can specify a time range that is relative to now with start: -5m which will return all data for the “BTC/USD” symbol from the past five minutes.

TimescaleDB

SELECT
  time,
  price
FROM crypto_ticks
WHERE
  "time" > NOW() - INTERVAL '5 minutes' AND
  symbol = 'BTC/USD'
ORDER BY time DESC
LIMIT 5;
time            |price  |
-------------------+-------+
2022-09-12 15:24:07|22346.7|
2022-09-12 15:24:03|22346.3|
2022-09-12 15:23:50|22346.7|
2022-09-12 15:23:45|22355.9|
2022-09-12 15:23:40|22358.1|

Enter fullscreen mode Exit fullscreen mode

In the TimescaleDB example, you can see a familiar SQL example (if you already know SQL) with a symbol filter and a relative time filter in the WHERE clause using the NOW() PostgreSQL function. Under the hood, however, this query gets executed differently from regular PostgreSQL: when you insert time-series data into the database, TimescaleDB auto-partitions your table based on the time column.

Then, when you make a query containing a time filter, like in this example, TimescaleDB can exclude whole chunks from scanning, which makes querying recent data lightning fast, even if you have billions of rows stored in the database.

Long-range analytical queries

What if you are also interested in analyzing longer time frames, e.g., all data from the past year? Maybe you want to see the highest price of a certain stock or crypto symbol in the past year.

InfluxDB

from(bucket: "crypto")
  |> range(start: -1y)
  |> group(columns: ["code"])
  |> max()
  |> group()
  |> sort(columns: ["_value"], desc: true)
Enter fullscreen mode Exit fullscreen mode

This example shows that Flux executes your query in the same order as you describe it.

TimescaleDB

SELECT
  symbol,
  MAX(price) AS max_price
FROM crypto_ticks
WHERE
  "time" >= NOW() - INTERVAL '1 year'
GROUP BY symbol
ORDER BY max_price DESC;
Enter fullscreen mode Exit fullscreen mode
symbol   |max_price |
---------+----------+
BTC/USD  |   48210.1|
WBTC/USD |  48169.56|
ETH/USD  |   3579.38|
BNB/USD  |  460.0|
SOL/USD  |  143.55|
Enter fullscreen mode Exit fullscreen mode

Analytical queries like this, with a larger time window as the filter, are not typical time-series queries, but you might want to run these from time to time. TimescaleDB provides two features that significantly speed up these queries: native compression, which saves space and converts your data into a columnar form, and continuous aggregates, which automatically maintain materialized aggregate data that can be retained separately from raw readings. Together, these features can have a dramatic effect on the performance of your application.

JOINing time-series data with other business data

Sometimes we only talk about time-series data without mentioning all the other data that real-world projects have in their data infrastructure. But the reality is that time-series data is always connected to non-time-series (business) data. If you plan to analyze your time-series data and business data together, the database you choose needs to be able to JOIN them and work with them quickly and simply. In the following examples, you can see how to join two tables in InfluxDB and TimescaleDB.

InfluxDB

crypto_assets = from(bucket: "crypto-assets")
    |> range(start: -1mo)
    |> filter(fn: (r) => r._measurement == "assets" and r._field == "symbol")
Enter fullscreen mode Exit fullscreen mode
crypto_ticks = from(bucket: "crypto-ticks")
    |> range(start: -1m)
    |> filter(fn: (r) => r._measurement == "ticks" and r._field == "price")
Enter fullscreen mode Exit fullscreen mode
join(
    tables: {assets:crypto_assets, ticks:crypto_ticks},
    on: [symbol, ],
)
Enter fullscreen mode Exit fullscreen mode

The big difference between InfluxDB and TimescaleDB in this regard is that InfluxDB can only store timestamped data, while TimescaleDB can store timestamped and non-timestamped data right next to each other. Thus, in InfluxDB you can only join time-series data with other time-series data but not relational data.

TimescaleDB

SELECT  crypto_assets.name,  bucket,  close,  high,  low,  open
FROM one_day_candle
INNER JOIN crypto_assets ON crypto_assets.symbol = one_day_candle.symbol
WHERE
  bucket > NOW() - INTERVAL '1 month' AND
  one_day_candle.symbol = 'BTC/USD'
ORDER BY bucket;
name    |bucket             |close  |high   |low    |open   |
-----------+-------------------+-------+-------+-------+-------+
Bitcoin USD|2022-08-13 02:00:00|24460.6|24889.5|24312.3|24402.2|
Bitcoin USD|2022-08-14 02:00:00|24312.4|25034.2|24160.4|24455.2|
Bitcoin USD|2022-08-15 02:00:00|24092.8|25210.9|23798.7|24316.2|
Bitcoin USD|2022-08-16 02:00:00|23867.7|24247.5|23692.0|24103.2|
Bitcoin USD|2022-08-17 02:00:00|23340.1|24430.1|23184.4|23857.3|

Enter fullscreen mode Exit fullscreen mode

In TimescaleDB, you can use PostgreSQL’s JOIN to connect any two tables in the same database, enabling you to store your non-time-series data next to your time-series data. Without this feature, you might have a harder time bringing your data together from multiple sources.

Fun fact: One of the reasons TimescaleDB was created was that the founders struggled to find a database that could do easy JOINs for time-series data.

Data retention and compression

While observing how thousands of companies handle time-series data, we found that it becomes less valuable over time. This means users often want to archive or even remove older data after a certain time to save on storage costs.

InfluxDB

influx example
In InfluxDB you can change the data retention settings on a per bucket basis on the UI. In older versions, you could also add data retention policies this way:

CREATE RETENTION POLICY "one_year" ON "some_dataset" DURATION 1y REPLICATION 1 DEFAULT
Enter fullscreen mode Exit fullscreen mode

TimescaleDB

Compression:
ALTER TABLE example SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = symbol
);
SELECT add_compression_policy(crypto_ticks, INTERVAL '2 weeks');
Enter fullscreen mode Exit fullscreen mode
Data retention:
SELECT add_retention_policy(crypto_ticks,, INTERVAL '1 year');
Enter fullscreen mode Exit fullscreen mode

With TimescaleDB, you can set up both a compression policy (to save on storage needs but keep the data available for querying) and a data retention policy (which gets rid of the data after the defined time period). Without essential tooling around data compression and data retention in the database, you’d need to implement and maintain these automations manually.

Next Steps

The value that time-series data provides is unquestionable. It’s like watching the whole movie instead of just the last frame. You do need a proper database that can play that movie for you though. I hope this article gave you the tools you need to choose the time-series database that best fits your team and your project.

Check out our docs for hands-on time-series tutorials:

If you are interested in Timescale, you can try Timescale for free or check out our GitHub, community Slackand join discussions with thousands of Timescale users.

Find this database comparison helpful?

Let us know what else you'd like to see in the comments!

Top comments (2)

Collapse
 
donniedev0 profile image
Donnie Clark

How do Timescale and ClickHouse actually stack up against each other in terms of performance?

Like, has anyone run any benchmarks to see a comparison for things like insert speeds and query performance?

Collapse
 
tigerdata_dev profile image
Team TigerData Tiger Data (Creators of TimescaleDB)

Hey Donnie!

Great question. We wanted to know the same — so our engineers invested hundreds of hours conducting an in-depth benchmark for an unbiased take.

ClickHouse vs TimescaleDB: A Comparison of Performance for Time-Series Data

Here's the quick rundown:

  • Inserts: ClickHouse outperforms with large batch sizes, but TimescaleDB is faster with smaller batches.
  • Queries: TimescaleDB (with compression enabled) outperforms ClickHouse on most queries in our benchmark suite, except for complex aggregates.
  • Disk usage: TimescaleDB uses less disk space, especially with smaller batch sizes.

The full details are in the blog post, which includes methodology and ways to replicate the tests yourself. Real-world performance can vary based on your specific use case, so it's always best to test with your own data and workloads.