DEV Community: kelvinsteve

Postgres Partitioning - Pg_partman vs. Hypertables

kelvinsteve — Wed, 13 Sep 2023 15:32:17 +0000

In the world of #PostgreSQL, partitioning techniques are game-changers for optimizing dataset performance and maintenance. 📊 Two, in particular, dominate the arena: Timescale's hypertables (tailored for time-based/range partitioning) and the versatile pg_partman extension.

Are you Team Hypertables or Team pg_partman?

Dive into this article to explore the strengths of each method, and tell us your preference!

[https://tsdb.co/partvshyp]

Building a Kafka Data Pipeline for Time Series With Kafka Connect and Timescale

kelvinsteve — Wed, 30 Aug 2023 16:59:46 +0000

What Is Kafka Connect

Apache Kafka Connect is an ecosystem of pre-written and maintained Kafka Producers (source connectors) and Kafka Consumers (sink connectors) for various other data products and platforms like databases and message brokers. This allows you to easily build data pipelines without having to write and test your own consumers and producers.

There are two distinct types of connectors:

Source connectors:
As the name suggests, these connectors act as a source of data and publish messages onto Kafka topics. For example, you can use the PostgreSQL JDBC source connector to publish a message onto a topic every time a row gets added to a table. This would allow you to set off a chain of events when (for example) someone posts a new message or a new user is created.
Sink connectors:
These connectors consume data from a Kafka topic and upload or insert that data onto a different data platform. For example, when someone makes a trade, you might want that event inserted into a time series database (like Timescale) for record-keeping and analytical purposes.

But the real benefit comes from the fact that you are not limited to a one-to-one relationship between producers and consumers. You can have multiple connectors act on the same message. Maybe your message is flowing between microservices, but you also want to store these messages in S3 and a data lake but also send it to another message broker. The sky's the limit when it comes to building pipelines using Apache Kafka Connect.

When I first started learning about Kafka and Kafka Connect, my biggest grievance was that there were almost no beginner-friendly end-to-end tutorials on properly setting up Kafka Connect for connectors that were more complicated than a local file sink. Because I had very limited Java experience, the ecosystem was quite daunting to wrap my head around, which made understanding and installing plugins all the more difficult.

My goal for this tutorial is to clearly explain every step to set up a JDBC Sink connector that streams data from a Kafka topic into a Timescale database without any guesswork. If you aren’t fond of blog posts and would rather just dive into the code, you can find the full shell script with all the necessary steps and commands here.

Kafka Connect Tutorial

This short tutorial will show you how to set up Kafka and Kafka Connect to stream data from a Kafka topic into a Timescale database.

Read the full article here: https://www.timescale.com/blog/building-a-kafka-data-pipeline-for-time-series-with-kafka-connect-and-timescale/

Take the State of PostgreSQL 2023 Survey.

kelvinsteve — Thu, 24 Aug 2023 13:19:29 +0000

The State of PostgreSQL 2023 survey is still open

💻 Quick questions about how and why you use PostgreSQL, how you got started & more

🗂 Fully anonymized source data will be shared publicly

Help create a knowledge base for the entire community

Start here

Expanding the Boundaries of PostgreSQL: Announcing a Bottomless, Consumption-Based Object Storage Layer Built on Amazon S3

kelvinsteve — Wed, 16 Nov 2022 16:20:25 +0000

We are excited to announce the initial launch, in private beta, of our new consumption-based, low-cost object storage layer in Timescale Cloud. This new capability expands the boundaries of traditional databases, allowing you to transparently tier your data across disk and Amazon S3 while accessing it as if it all lived in one single continuous PostgreSQL table. This means that you can now store an infinite amount of data in Timescale Cloud, paying only for what you store.

Our new feature expands the boundaries of traditional managed databases with a novel cloud-native architecture designed to save developers money and time, incorporating elements typically associated with data warehouses and data lakes, enabling PostgreSQL developers to grow their data effortlessly for a fraction of the price of traditional storage.

Rather than operating an external system for data archiving alongside a time-series database which often creates data silos and operational overhead, developers can now simply work with a single table where data is transparently tiered across different storage systems while retaining the ability to query that data via standard SQL.

We built new database internal capabilities and external subsystems to give application developers access to an object store from within a PostgreSQL database in Timescale Cloud. The result is that time-series tables, called hypertables, can now stretch across standard disk storage (in Amazon EBS) and object storage (in Amazon S3), with data formats that are optimized for each layer. Query optimizations ensure that data is fetched from both disk and object storage for a single SQL query, abstracting away complexity from the PostgreSQL developer.

In addition to cost-efficient data tiering, future enhancements to our newly released object store will seek to enable easier data sharing between fleets of databases, faster transfer of data into new deployments, and smoother migration of data from external data warehouses to Timescale Cloud.

This new consumption-based, low-cost object storage is currently available under private beta. Users with a Timescale Cloud account can request access— creating an account is free and it gives developers full access to the platform for 30 days.

Read the full announcement here

How Timescale is Raising the Bar on Hosted Database Support

kelvinsteve — Thu, 15 Sep 2022 15:51:18 +0000

If choosing a database in a crowded market is challenging, choosing where you will run said database can prove equally tricky. The options are endless, but it’s pretty common these days to run your database on a hosted service, whether that’s something like Amazon RDS or a database company’s own cloud, like the one we have at Timescale. Still, one of the most critical aspects to consider when selecting a hosted provider is cloud support. What will your cloud provider do to help ensure your hosted database is successful?

The Value of Deep, Consultative Support

From the user’s perspective, one should expect a minimum set of services from the hosting provider, from backups to speedy and steady help when the database is down or other platform-related issues arise, etc. But what if your queries are slow? Or what if a feature does not work as expected? Or would you like guidance on how to architect for future growth? You may not have to consider the physical infrastructure when running a cloud database, but these are still real and valid concerns.

Here at Timescale, we want you to be successful with your time-series and relational data. One of the ways we do that is through our Slack Community, where many of us here at Timescale work together with many folks in our community on questions and issues. Many folks from our community work with each other as well. It’s a fantastic, vibrant community and is open to everyone. We also have a very active Timescale Community Forum. But that’s not all we do—we are raising the bar on hosted database support.

For our users on Timescale Cloud, we take it one step further. We have a team of support engineers located around the world to help with migrations, data modeling, query or ingest performance, compression settings, and more. We offer deep, consultative support for every Timescale Cloud user at no additional charge.

This is an investment we’re making in our users, just as our users are making an investment in us by choosing us to host their time-series and relational data. As a comparison, with Amazon RDS, deeply consultative support in addition to general guidance and best practices starts at over $5,000 per month, and lower tiers have only a community forum or only offer general advice.

Here’s what you can expect from working with the Timescale Support Team.

Every Case Is Unique

Adopting a new technology (or refining the operations of the one you’re already using) is hard. While TimescaleDB is an extension on top of PostgreSQL, making it much more immediately recognizable and reducing the amount of new information to learn, we still have our own lingo and concepts. Hypertables and continuous aggregates and compression. Oh, my!

The thing about doing support for a product like TimescaleDB is that there often aren’t easy answers. This is actually a good thing! Our users are talented professionals. Because of that, no easy answers means that the things that do have easy answers—basic configurations, API conventions, etc.—are either well documented or intuitive enough that folks have already figured them out.

What gets left over are the less clear-cut questions and scenarios: Why did this query perform poorly? What should I use for segmenting in compression based on X, Y, and Z factors? Should I use one hypertable or many hypertables?

Editor's Note: One of the most common questions is optimizing chunk size. (“Chunk” is Timescale lingo for data partitions within a table.) We answered this question in this blog post

These questions have many variables, requiring some back and forth. That is the key to our support here at Timescale: collaboration. We don’t tend just to give rote answers. Because when it comes to data, there is no one-size-fits-all strategy. We work with you to understand your problem and requirements and devise a solution that works for you in your specific circumstances. Our approach is inquisitive and exploratory because we want to answer not only the question you have now but the ones you don’t know you have yet.

Work With Your Customers, Improve Your Product

We believe strongly in working through questions and answers together. We’re all subject to the unknown unknowns of the Johari window, so by working together, we collectively increase the things we know and decrease the things we don’t. We understand that when you’re looking for help, the last thing you probably want to get back is a bunch of questions. We ask them to gain a more holistic understanding of the issue at hand to provide more than a bandaid but a real, lasting solution.

Through our interactions with individual customers, we improve our product for everyone.

We view every support case as an opportunity to learn, both as individuals and as a company. As a company, your cases tell us where we can improve. If something isn’t working, that’s something we should look into making better. As individuals, we learn new and innovative ways to look at, analyze, collect, and use data. We learn about operational models and DevOps practices. We learn about new and exciting technologies that work alongside our own that help you answer the questions you need to ask of your data.

In support, at our core, we are eternally curious. Working with you gives us the opportunity to explore new and different things every day. Along the way, we hope we can both learn something as we work through your questions together. At the end of the day, this interchange of ideas makes us all better.

Work With Us!
We’d love to work through some of your questions together. If you are a current user of Timescale Cloud, you know where we’re at—shoot us a message at support@timescale.com.

If you’re not yet a Timescale Cloud user, you can use it for free for 30 days, no credit card required—and you have full access to our support team during your trial. We look forward to working with you!

What Does a PostgreSQL Commitfest Manager Do and Should You Become One?

kelvinsteve — Thu, 25 Aug 2022 17:52:03 +0000

I got the chance to volunteer as a PostgreSQL Commitfest Manager (CFM) this past July. As a full-time developer and contributor to PostgreSQL since 2020, I really enjoyed the opportunity to interact with the community more broadly than I usually do. This particular commitfest was the first of five patch-review-and-commit events for the PostgreSQL 16 development cycle.

Contributing to open-source communities can be incredibly rewarding, but finding a place to begin can be challenging, and there’s often a lot of institutional knowledge hidden away. Looking at the worldwide surveys we held here at Timescale in the last few years, it seems like this is an issue in many PostgreSQL users’ minds. To shed some light on some common questions, fellow PostgreSQL contributor, Aleksander Alekseev, showed you how to contribute your first patch in this article while developer advocate Ryan Booz dug deeper into ways to give back to the community, whether with code or beyond code.

In this blog post, I’ll talk more about contributions beyond code by giving you an overview of the CFM position and discussing some of my recommended prerequisites. If this sounds enjoyable, watch this space: I’ll share my July CFM experience, including a timeline, observations, and lessons learned, in a future blog post.

Note that the contents of this post are my personal opinion only—they don't represent official rules or positions from the community. (For those, you can take a look at the PostgreSQL Code of Conduct.)

Now let’s kick off this blog fest.

Should You Be a PostgreSQL Commitfest Manager?
So who should volunteer to manage a commitfest? Personally, I recommend that potential CFMs should already feel comfortable interacting with the PostgreSQL development community. The following will serve you well:

You should know what a commitfest (CF) is.
For the commitfest you want to manage, you should know where the community will be in the development cycle. A CF at the beginning or end of a PostgreSQL development cycle is going to be more work (and, at the end of the cycle, perhaps more emotionally charged* than one in the middle)

Why "emotionally charged"?
The end of the cycle is the last chance for patch sets to land in the upcoming release. For contributors who've been working on a patch set for a while, it can be disappointing and frustrating to wait another year for a feature they've poured their volunteer time into.

You will need a PostgreSQL Community account so that you can log into the Commitfest App, which tracks entries that have been submitted for review.
You should already be able to send and receive emails on the pgsql-hackers mailing list. The CF app will allow you to send mail from the website if you choose, but it's outgoing only; you'll still want a subscribed email address to read others' responses to you. (And honestly, this whole process will be much, much easier if you already have an email workflow set up.)
You should be aware of, and nominally comfortable with, the cfbot CI interface. This is still not integrated into the CF app as of this writing (it’s a highly requested feature!). So, it'll help a lot if you already know how to read the build status of a particular patch set quickly.
It's nice, though not strictly required, if you can download, apply, and test patches from the mailing list. This allows you to perform your own independent review during triage, and if you can do this, you're more likely to understand the overall process from the reviewers' perspective. But reviewing patches is not your primary focus when wearing your CFM hat.
And finally, consider contributing a patch before volunteering to manage a commitfest! (No, it doesn't have to have been committed.) This is mostly to help develop contributor empathy since you'll be interacting with many first-time contributors. In my opinion, it'll help to remember the emotional range (Excitement! Worry! Fear!) that can accompany that first patch. But it's not a hard requirement since, as before, you're not going to be focused on code changes as a CFM.

If you're missing some of that skillset or you're worried you might be, can you still volunteer? Absolutely! And you don't have to do it alone—take a buddy. Ask for (or volunteer to be) an assistant CFM so that you can tag-team with someone who can smooth over any gaps in knowledge and answer questions about the process.

(Like most barriers to OSS contribution, hopefully this is something that we can all help to make easier over time. The fewer the number of ‘gotchas’ in the contribution process, the less a CFM will have to keep in their head, and the easier it will be for newer community members to successfully manage a commitfest.)

A Commitfest Manager's Job

What does a CFM do?

Mechanically, you will send emails to many people. There will be plenty of administrative work to ensure that the CF tracker reflects reality in a helpful way. (Maybe you'll have ideas on how to reduce the amount of that work.)

You'll be answering one-off questions from contributors, like explaining pieces of the process or helping someone find a patch to review. For very new contributors, you may be a first (or simply the most visible) point of contact, and you'll be actively working to keep their patches from falling through the cracks—there's a lot to remember when someone makes their first contribution.

As with many open-source volunteer positions, everyone brings different goals and skill sets to the role. The common thread seems to be to keep things flowing: help people when they get stuck, remind contributors that they should try to give as much as they receive, and draw community attention to small problems before they get bigger. Some CFMs have focused on an extremely comprehensive evaluation of entries during triage or finding the right people to review patches in certain areas. I tried to bump threads that had gone quiet and needed to be either helped along or returned so that the author wasn't stuck waiting.

Your job is not to decide which patches get committed and which do not, or even which patch sets are worthy of review and which are not. But you will exercise some independent judgment at the end of the CF, when you’ll recommend how patches are closed.

Exercise restraint with that judgment. My personal opinion is that, ideally, you want to be finding and channeling a consensus from the community as opposed to asserting your own opinions over patches. That's a really hard goal to meet in practice because you'll have opinions, and you don't want to start finding ways to disguise those opinions as "coming from the community." So just do your best: speak and then listen, be open about why you're taking action, and be honest when you have strong personal opinions.

And you can absolutely be an author or reviewer while you're managing a commitfest. Try to keep the roles separate, though: make it obvious to new contributors whether you're telling them something as a CFM or as a reviewer, since having a "title" can sometimes imply weight that wasn't intended.

A Commitfest Conclusion

It’s easy to look at open-source software from a pure code perspective. But the community is more than just its codebase, and the CFM role is interesting in that most of its responsibilities lie outside the repository. If you’re looking to build (or brush up on) your community participation and communication skills while getting a broad look at the state of PostgreSQL, managing a commitfest for a month might be a good fit for you.

In the following article, I’ll share my personal experience and cover all the steps in the CFM timeline, from your initial preparation to closing the CF, plus some special cases I encountered along the way. The 2022 September PostgreSQL commitfest is starting soon: I hope these posts will help make the process less daunting and encourage others to volunteer and give back to this awesome community.

Written by Jacob Champion.

How High Availability Works in Timescale Cloud Database

kelvinsteve — Wed, 10 Aug 2022 14:03:00 +0000

When our customers first get interested in Timescale Cloud and we mention the importance of high availability, we often hear something along these lines:

“Now that our workload is growing, hosting our own database is starting to consume too much time. We’re ready to move to a hosted service so we can free up some time for our team. But, we’re unsure about the idea of giving our data to somebody else. What would happen if there’s a failure? How do I know our data is protected?”

Hosted databases in the cloud are the future, but you should still understand how your hosted database is working under the hood to ensure it meets your high availability needs. Database failures used to keep database administrators up at night. That shouldn’t be the case anymore, with hosted database services taking that load off the DBAs shoulders by keeping databases up and running. Still, vendors can choose to be more transparent: how they keep the service working seamlessly is the kind of information that does not belong in a black box!

In this blog post, we’re throwing the black box out the window to explain how data availability works in Timescale Cloud within its cloud-native architecture. Even if you are not a Timescale Cloud user (yet), this read may give you a glimpse of how we’ve designed the platform and built a high-availability cloud database using AWS infrastructure.

What Is High Availability?

As the DBA, “availability” means how often you can interact with your database as expected. If your database is available, you are able to perform normal operations on it and your end-users (and your business) will remain unaffected by any database issues.

In this context, the term “high availability” (HA) is often used to describe a system in which you can expect minimal downtime. The exact level of downtime you can expect in an HA system depends on your vendor; there’s no universally accepted definition of high availability, although it often varies between a few seconds and a few minutes.

Companies often describe their service availability in terms of RTO (Recovery Time Objective) and RPO (Recovery Point Objective). These are fancy terms that actually mean very simple concepts:

The RTO is how long will it take for your service to recover in terms of failure, usually considering the worst-case scenario.
The RPO is how much data could you lose if a recovery takes place.

If you are a DBA, needless to say that you care a lot about these two things. If you’ve managed a database on-premise before or even in a private cloud, you know that failures do happen. By trusting your operations to a hosted database provider, you want to know how much downtime you may experience. And you want to rest assured that no data loss will occur.

We’ll eventually answer how this works in Timescale Cloud specifically, but before we get there, let’s spend a few minutes talking about what the infrastructure of a hosted database looks like, focusing on AWS (which is where Timescale Cloud is hosted). Having a mental model of how the underlying infrastructure works will help you understand better which types of failures may happen, and what will be done to put your service back up and running again.

What Causes Downtime in a Hosted Database (and Why Choose a Hosted Database at All)

If we look a bit deeper, when we talk about high availability in a database, we are really talking about two different elements:

The first element is your base operations: how reliable the system itself is. This includes how often your infrastructure experiences outages, causing an important disruption in your normal operations. In a hosted database service, this largely depends on the reliability of the underlying physical hardware (for example, the AWS infra).
The second piece is disaster recovery. This indicates how quickly a system can recover when things do go wrong, i.e., when a major problem occurs, how quickly can the system resume normal operations. Disaster recovery doesn’t depend on the underlying infrastructure that much, but on how the system itself is engineered (e.g., how we’ve designed Timescale Cloud).

Historically, on-premise was the only option. Databases were always hosted by the companies themselves in their own data centers and operated by their own engineers—this is still the case for many companies.

For these self-hosted teams, base operations are a crucial element of keeping their database up: this is a task that requires specialized skills in hardware management and database administration. For example, they would be in charge of evaluating different options of compute and storage servers, purchasing them, and setting them up; the database would also need to be installed and properly configured (and eventually updated); the system would need to be operated and engineered; a set of operative rules need would need to be put in place to determine what to do when storage corrupts or compute fails; backups would need to be maintained and tested…

When self-hosting their database, engineering teams need to be prepared for any sort of event that may cause a hardware outage, including beavers_

This is a lot of work. Managing their own database is surely something possible for some engineering teams—but others may prefer to focus all their efforts towards building their application instead of spending them on database maintenance and operations. These teams often choose to use a hosted database service like Timescale Cloud.

Timescale Cloud, like many hosted database services, runs in AWS. This means that AWS handles the management and reliability of the underlying hardware—and they’re very good at it. By choosing a database hosted in AWS, we can forget about the physical maintenance of our infrastructure. This delegates the first element of availability, related to the maintenance of base operations, to AWS.

To understand what this actually means in the case of Timescale Cloud, it’s worth doing a quick overview of the AWS components that are actually being used to host the database:

The compute piece of the hardware is covered by EC2 instances. In Timescale Cloud, the compute handles the PostgreSQL server, the connections to the database, and it also holds the local memory. (EC2 stands for Elastic Compute 2).

The database storage piece is covered by EBS instances. This is where the data actually lives—your disk storage and file system. (EBS stands for Elastic Block Store).

The backups and WAL are stored in S3 as the long-term storage element. These are things that don’t have to be accessed regularly as part of the daily database operations and thus benefit from being stored in S3, which is a bit slower to access than EC2 but incredibly reliable. (S3 stands for Simple Storage Service.)

Timescale Cloud was designed as a cloud-native platform from the start. We rely on AWS for our underlying hardware infrastructure and have built automated detection and recovery for scenarios when a piece of hardware fails, such as an EC2 instance. Though our underlying availability can only be as good as the hardware it is built upon, we can do some engineering magic on top of this hardware to cover those situations, minimizing the impact on our users—we discuss this magic later in this post.

An important consequence of our cloud-native approach is that the compute and storage pieces are not tied together in Timescale Cloud, differently as they would be if we were using a traditional server. This allows us to offer some nice benefits to our users. For example, as the end-user of Timescale Cloud, you’re able to scale up and down your compute and storage independently, which is very convenient and cost-efficient. But having a decoupled compute and storage architecture has benefits beyond cost-efficiency: as we’re about to see, it also increases availability.

How Timescale Cloud Handles Compute Failures

We said before that AWS does a great job in keeping their infrastructure up and running. But how good?

In the figure below, you can see the levels of availability that AWS defines for each one of the components that conform to a Timescale Cloud service (EC2, EBS, S3). These availability levels are all very high, but they are not 100 %. Hardware failures will happen sometimes, even to AWS.

These numbers are relevant for assuring high availability in Timescale Cloud. If you pay attention to the numbers above, EC2 (the compute piece) will fail significantly more often than the storage. Statistically speaking, roughly 9 out of 10 times that you experience hardware failures in a hosted service, they will be due to a compute failure.

So what happens to your Timescale Cloud service if the underlying EC2 instance that’s hosting your database compute fails?

This is when the decoupled compute-storage architecture of Timescale Cloud comes in extremely handy. In a traditional database setup on-premise, you would always need to do a recovery from backup, even in the case of a compute failure—and as we’ll see later in the post, recovering from backups can be a lengthy process. This means that even a compute failure would cause significant downtime to your end users.

But since the compute and storage nodes are decoupled in Timescale Cloud, if the compute fails, we can automatically spin up a new compute node, attaching your undamaged storage unit to it. This recovery process takes only seconds in the majority of cases, and it’s done without any action needed from you. The only thing you will notice will be a reset of your database connections.

What If There's a Failure Affecting Your Storage?

As we saw earlier, AWS is very good at managing hardware. Failures affecting the storage side of things (EBS in the case of Timescale Cloud) are way less common—and yet they happen from time to time. How will your managed database service handle recovery in this case?

Reducing downtime as much as possible: Replication

A first failover scenario involves the use of replicas.

In Timescale Cloud, users can enable a replica in one click when they create their service (or anytime after the fact). This replica will stay in sync with the primary database at all times, containing the exact same information and configuration. If something occurs that makes the data stored in the primary database unavailable, the platform will automatically switch all operations to the replica, which contains an up-to-date copy of your data. This process takes only a few seconds (<10s), which is the only downtime that your end-users will experience. Often the only thing noticeable is a reset of connections to the database.

This will effectively fix the problem for you and your end-users.

In a normal operating state, the application is connected to the primary and optionally to its replica. The load balancer handles the connection and defines the role for each node.

When the primary database fails, the platform updates the roles, “promoting” the replica to the primary role, with the primary load balancer redirecting traffic to the new primary. In the meantime, the system begins the recovery of the failed node.

After the failover process has been completed, Timescale Cloud will proceed to repair the damaged node, which will eventually become the new replica.

We always recommend our users to enable replication for mission-critical workloads, as it significantly increases the availability of their service. If your system requires uptime guarantees, replicas are the option for you.

Also, replicas in Timescale Cloud are automatically created in a different Availability Zone (AZ) than your primary database for extra peace of mind. AWS hosts their infrastructure in different regions across the globe (e.g., us-east-1). For extra security, the regions are divided into multiple availability zones, which remain isolated from each other (e.g., power may go down in one AZ without affecting the others within the same region). To have your replica and your primary database hosted in different AZs gives you extra redundancy in case an entire AZ goes down.

The last resource: Backups

In Timescale Cloud, replicas are strongly recommended—but not enabled by default (as they increase the cost of your service). But we (of course) ensure data protection to all our services, not only those with a replica. If you don’t have a replica enabled and there’s a failure affecting your storage, your good old friend the backups will come to the rescue.

If you’ve ever dealt with databases on-premise or in your own cloud, you are already familiar with backups. By backing up your database every X period of time, you can restore to the latest backup if there’s a failure affecting your database, which essentially means getting your data into a new database.

Backups are the historic way of dealing with database failures, but recovery from backups can be a rather slow process that is limited by the quality and frequency of the latest backups. If the latest backup was two days ago, then the last two days of data might be lost!

Even though this problem is mostly solved today by tools like pgbackrest, configuring the backup strategy, testing backups, and automating the recovery is a time-intensive process… And it can be rather stressful.

In a database with cloud-native infrastructure like Timescale Cloud, backups are our safety net but not our only resource. As we explained earlier, having a cloud-native infrastructure allows us to fix compute failures without touching our backups—and for mission-critical applications, we always recommend enabling replication, so you can be protected against the potentially longer downtime caused by more severe failures.

But not all workloads are mission-critical. Perhaps you have certain services which are powering internal dashboards, machine learning models, or hosting historic data that you use to build weekly reports—for systems like these, you may decide that having a little downtime may not be critical, and you may choose not to enable a replica. If some of these services experience a failure affecting the storage, how does Timescale Cloud recover your data?

First, Timescale Cloud keeps up-to-date backups of all services:

Full backups are taken weekly. This process is done automatically by the platform (you don’t need to do anything manually) and for all services.
Incremental backups are taken daily. These backups record the changes made since the last full backup.
On top of these full and daily backups, Timescale Cloud keeps WAL (write-ahead log) files of any changes made to the database. This WAL can be replayed in the event of a failure to reproduce any transactions not captured by the last daily backup, e.g., to replay the changes made to your database during the last few hours.

By combining these three elements, we can do a point-in-time recovery—we can recover a database to any point and you won’t experience any data loss.

The figure above illustrates the process of recovery from backup. How long does this take? As we’ve mentioned before, this is the longest recovery process—you will experience substantially more downtime than if you have replication enabled. The exact amount of downtime, however, will depend on multiple factors, including how up-to-date your backups are, how much data you have in general, and your compute size (how much CPU/memory you have available).

High Availability in Cloud: A Summary

We hope you now have a better view of how the infrastructure behind a hosted database service really looks and about the different strategies one can follow to achieve as much availability as possible.

In the particular case of Timescale Cloud, the platform protects your data against failure automatically, with very low RTO (Recovery Time Objective) and RPO (Recover Point Objective) for all instances. For mission-critical workloads with high-availability requirements, Timescale Cloud also offers replicas—which ensure near-zero downtime and near-zero data loss if the database fails. Click here to learn how to enable a replica in your Timescale Cloud service. If you still haven’t tried Timescale Cloud, you can create an account here. You will get free access to the platform for 30 days, no credit card required.

Timescale Releases Third State of PostgreSQL Report

kelvinsteve — Tue, 02 Aug 2022 15:15:00 +0000

Timescale is proud to announce the results of its 2022 State of PostgreSQL survey. Timescale is the maker of TimescaleDB, the open-source relational database for time-series and one of the most popular PostgreSQL extensions. This year, nearly 1000 respondents participated in the 4-week survey providing developers unique insight into the PostgreSQL market.

View the full report here.

The State of PostgreSQL survey provides more significant insights into PostgreSQL features and the broader PostgreSQL community. In 2019, the first edition of the report was released with feedback collated from over 500 developers. In 2021, the second edition of the report sampling close to 500 participants was released. In both years, respondents were mainly software developers/engineers, software architects, and database administrators from the EMEA (Europe, Middle East, Africa) region accounting for 65% participation, North America 25.9%, and South America 7%.

“By organizing and delivering the State of PostgreSQL report, we help developers and developer-centric companies and communities better understand what is happening with Postgres today: the different types of Postgres users, the kinds of use cases they are tackling, the places they go to share and learn, how that is all changing, and the opportunities for improvement in the Postgres community overall,” said Ajay Kulkarni, Chief Executive Officer. “This also gives us an opportunity to give back to the broader PostgreSQL community, a community we are proud to be a part of and that has been very helpful to us.”

This year, as participation in the survey increased, so did the popularity of PostgreSQL. Key findings from this year’s report include why respondents use PostgreSQL, how they contribute to the community, adoption across organizations, and favorite tools and extensions.

Other highlights include:

Usage

Over three-quarters of respondents report using PostgreSQL for personal projects. 95% of all respondents use PostgreSQL at work, and 73.5% use Postgres for both personal and professional projects.

Why PostgreSQL?
Open source ranked as the #1 reason people choose PostgreSQL (19.3%), followed by reliability (16.5%) and extensions (9.9%). This shows that Open source continues to be a high value proposition for most users.

Contributions
44% of PostgreSQL users with 15+ years of experience have contributed to PostgreSQL at least once. Among those with less than 15 years of experience, only 12% have contributed to PostgreSQL. This percentage represents a great opportunity for the PostgreSQL community to welcome new people and ideas into the fold. (For more information, read our guide to contributing to PostgreSQL

Ways Of Learning
The majority of respondents (76.2%) identified technical documentation as their preferred way of learning about PostgreSQL, followed by long-form blog posts (51.5%) and short-form blog posts (43.3%). Respondents with less than five years of PostgreSQL experience prefer video over long-form and short-form blog posts.

Extensions
One of the most powerful aspects of PostgreSQL is its rich extensibility model. Developers mix and match several purpose-built extensions to help them build more compelling applications. This year’s list of most favored extensions is stable compared to last year:

For more information, see our blog post about the top PostgreSQL extensions.

Tools
Of the respondents who use tools to connect PostgreSQL for queries and administration tasks, psql (69.4%), pgAdmin (35.3%), and DBeaver (26.2%) were the top three choices. For more information, see our blog post on PostgreSQL tools, in addition to psql.

Deployment
Compared to 2019 and 2021, fewer respondents reported self-managing their PostgreSQL database. It appears that PostgreSQL users are increasingly using DBaaS providers for deploying PostgreSQL. Of those who deploy PostgreSQL as a Kubernetes container, 44% use Helm, 16% use Crunchy Operator, and 7% use Zalando Operator.

About Timescale

Timescale is the creator of TimescaleDB; the industry-leading database for time-series. Tens of thousands of organizations trust TimescaleDB today with their mission-critical time-series applications. The company is dedicated to serving software developers and businesses worldwide, enabling them to build the next wave of computing. Timescale is a remote-first company with a global workforce and is backed by Tiger Global, Benchmark Capital, New Enterprise Associates, Redpoint Ventures, Icon Ventures, Two Sigma Ventures, and other leading investors. For more information, visit www.timescale.com or follow @TimescaleDB.

Contacts
PRESS CONTACT:
Kelvin Steve
Senior Communications Manager
press@timescale.com

How We Made Data Aggregation Better and Faster on PostgreSQL With TimescaleDB 2.7

kelvinsteve — Tue, 21 Jun 2022 14:15:28 +0000

Time-series data is the lifeblood of the analytics revolution in nearly every industry today. One of the most difficult challenges for application developers and data scientists is aggregating data efficiently without always having to query billions (or trillions) of raw data rows. Over the years, developers and databases have created numerous ways to solve this problem, usually similar to one of the following options:

DIY processes to pre-aggregate data and store it in regular tables. Although this provides a lot of flexibility, particularly with indexing and data retention, it's cumbersome to develop and maintain, particularly deciding how to track and update aggregates with data that arrives late or has been updated in the past.

Extract Transform and Load (ETL) process for longer-term analytics. Even today, development teams employ entire groups that specifically manage ETL processes for databases and applications because of the constant overhead of creating and maintaining the perfect process.

Materialized Views. While these VIEWS are flexible and easy to create, they are static snapshots of the aggregated data. Unfortunately, developers need to manage updates using TRIGGERs or CRON-like applications in all current implementations. And in all but a very few databases, all historical data is replaced each time, preventing developers from dropping older raw data to save space and computation resources every time the data is refreshed.

Most developers head down one of these paths because we learn, often the hard way, that running reports and analytic queries over the same raw data, request after request, doesn't perform well under heavy load. In truth, most raw time-series data doesn't change after it's been saved, so these complex aggregate calculations return the same results each time.

In fact, as a long-term time-series database developer, I've used all of these methods too, so that I could manage historical aggregate data to make reporting, dashboards, and analytics faster and more valuable, even under heavy usage.

I loved when customers were happy, even if it meant a significant amount of work behind the scenes maintaining that data.

But, I always wished for a more straightforward solution.

How TimescaleDB Improves Queries on Aggregated Data in PostgreSQL

In 2019, TimescaleDB introduced continuous aggregates to solve this very problem, making the ongoing aggregation of massive time-series data easy and flexible. This is the feature that first caught my attention as a PostgreSQL developer looking to build more scalable time-series applications—precisely because I had been doing it the hard way for so long.

Continuous aggregates look and act like materialized views in PostgreSQL, but with many of the additional features I was looking for. These are just some of the things they do:

Automatically track changes and additions to the underlying raw data.
Provide configurable, user-defined policies to keep the materialized data up-to-date automatically.
Automatically append new data (as real-time aggregates by default) before the scheduled process has materialized to disk. This setting is configurable.
Retain historical aggregated data even if the underlying raw data is dropped.
Can be compressed to reduce storage needs and further improve the performance of analytic queries.
Keep dashboards and reports running smoothly.

What About Other Databases?

By now, some readers might be thinking something along these lines:

“Continuous aggregates may help with the management and analytics of time-series data in PostgreSQL, but that’s what NoSQL databases are for—they already provide the features you needed from the get-go. Why didn’t you try a NoSQL database?”

Well, I did.

There are numerous time-series and NoSQL databases on the market that attempt to solve this specific problem. I looked at (and used) many of them. But from my experience, nothing can quite match the advantages of a relational database with a feature like continuous aggregates for time-series data. These other options provide a lot of features for a myriad of use cases, but they weren't the right solution for this particular problem, among other things.

What about MongoDB?
MongoDB has been the go-to for many data-intensive applications. Included since version 4.2 is a feature called On-Demand Materialized Views. On the surface, it works similar to a materialized view by combining the Aggregation Pipeline feature with a $merge operation to mimic ongoing updates to an aggregate data collection. However, there is no built-in automation for this process, and MongoDB doesn't keep track of any modifications to underlying data. The developer is still required to keep track of which time frames to materialize and how far back to look.

What about InfluxDB?
For many years InfluxDB has been the destination for time-series applications. Although we've discussed in other articles how InfluxDB doesn't scale effectively, particularly with high cardinality datasets, it does provide a feature called Continuous Queries. This feature is also similar to a materialized view and goes one step further than MongoDB by automatically keeping the dataset updated. Unfortunately, it suffers from the same lack of raw data monitoring and doesn't provide nearly as much flexibility as SQL in how the datasets are created and stored.

What about Clickhouse?
Clickhouse, and several recent forks like Firebolt, have redefined the way some analytic workloads perform. Even with some of the impressive query performance, it provides a mechanism similar to a materialized view as well, backed by an AggregationMergeTree engine. In a sense, this provides almost real-time aggregated data because all inserts are saved to both the regular table and the materialized view. The biggest downside of this approach is dealing with updates or modifying the timing of the process.

Recent Improvements in Continuous Aggregates: Meet TimescaleDB 2.7

Continuous aggregates were first introduced in TimescaleDB 1.3 solving the problems that many PostgreSQL users, including me, faced with time-series data and materialized views: automatic updates, real-time results, easy data management, and the option of using the view for downsampling.

But continuous aggregates have come a long way. One of the previous improvements was the introduction of compression for continuous aggregates in TimescaleDB 2.6. Now, we took it a step further with the arrival of TimescaleDB 2.7, which introduces dramatic performance improvements in continuous aggregates. They are now blazing fast—up to 44,000x faster in some queries than in previous versions.

Let me give you one concrete example: in initial testing using live, real-time stock trade transaction data, typical candlestick aggregates were nearly 2,800x faster to query than in previous versions of continuous aggregates (which were already fast!)

Later in this post, we will dig into the performance and storage improvements introduced by TimescaleDB 2.7 by presenting a complete benchmark of continuous aggregates using multiple datasets and queries. 🔥

But the improvements don’t end here.

First, the new continuous aggregates also require 60 % less storage (on average) than before for many common aggregates, which directly translates into storage savings. Second, in previous versions of TimescaleDB, continuous aggregates came with certain limitations: users, for example, could not use certain functions like DISTINCT, FILTER, or ORDER BY. These limitations are now gone. TimescaleDB 2.7 ships with a completely redesigned materialization process that solves many of the previous usability issues, so you can use any aggregate function to define your continuous aggregate. Check out our release notes for all the details on what's new.

And now, the fun part.

Show Me the Numbers: Benchmarking Aggregate Queries

To test the new version of continuous aggregates, we chose two datasets that represent common time-series datasets: IoT and financial analysis.

IoT dataset (~1.7 billion rows): The IoT data we leveraged is the New York City Taxicab dataset that's been maintained by Todd Schneider for a number of years, and scripts are available in his GitHub repository to load data into PostgreSQL. Unfortunately, a week after his latest update, the transit authority that maintains the actual datasets changed their long-standing export data format from CSV to Parquet—which means the current scripts will not work. Therefore, the dataset we tested with is from data prior to that change and covers ride information from 2014 to 2021.

Stock transactions dataset (~23.7 million rows): The financial dataset we used is a real-time stock trade dataset provided by Twelve Data and ingests ongoing transactions for the top 100 stocks by volume from February 2022 until now. Real-time transaction data is typically the source of many stock trading analysis applications requiring aggregate rollups over intervals for visualizations like candlestick charts and machine learning analysis. While our example dataset is smaller than a full-fledged financial application would maintain, it provides a working example of ongoing data ingestion using continuous aggregates, TimescaleDB native compression, and automated raw data retention (while keeping aggregate data for long-term analysis).
You can use a sample of this data, generously provided by Twelve Data, to try all of the improvements in TimescaleDB 2.7 by following this tutorial, which provides stock trade data for the last 30 days. Once you have the database setup, you can take it a step further by registering for an API key and following our tutorial to ingest ongoing transactions from the Twelve Data API.

Creating Continuous Aggregates Using Standard PostgreSQL Aggregate Functions

The first thing we benchmarked was to create an aggregate query that used standard PostgreSQL aggregate functions like MIN(), MAX(), and AVG(). In each dataset we tested, we created the same continuous aggregate in TimescaleDB 2.6.1 and 2.7, ensuring that both aggregates had computed and stored the same number of rows.

IoT dataset

This continuous aggregate resulted in 1,760,000 rows of aggregated data spanning seven years of data.

CREATE MATERIALIZED VIEW hourly_trip_stats
WITH (timescaledb.continuous, timescaledb.finalized=false)
AS
SELECT
time_bucket('1 hour',pickup_datetime) bucket,
avg(fare_amount) avg_fare,
min(fare_amount) min_fare,
max(fare_amount) max_fare,
avg(trip_distance) avg_distance,
min(trip_distance) min_distance,
max(trip_distance) max_distance,
avg(congestion_surcharge) avg_surcharge,
min(congestion_surcharge) min_surcharge,
max(congestion_surcharge) max_surcharge,
cab_type_id,
passenger_count
FROM
trips
GROUP BY
bucket, cab_type_id, passenger_count

Stock transactions dataset

This continuous aggregate resulted in 950,000 rows of data at the time of testing, although these are updated as new data comes in.

CREATE MATERIALIZED VIEW five_minute_candle_delta
WITH (timescaledb.continuous) AS
SELECT
time_bucket('5 minute', time) AS bucket,
symbol,
FIRST(price, time) AS "open",
MAX(price) AS high,
MIN(price) AS low,
LAST(price, time) AS "close",
MAX(day_volume) AS day_volume,
(LAST(price, time)-FIRST(price, time))/FIRST(price, time) AS change_pct
FROM stocks_real_time srt
GROUP BY bucket, symbol;

To test the performance of these two continuous aggregates, we selected the following queries, all common queries among our users for both the IoT and financial use cases:

SELECT COUNT (*)
SELECT COUNT (*) with WHERE
ORDER BY
time_bucket reaggregation
FILTER
HAVING

Let’s take a look at the results.

Query #1: `SELECT COUNT(*) FROM…`

Doing a COUNT() from PostgreSQL is a known performance bottleneck. It's one of the reasons we created the approximate_row_count() function in TimescaleDB which uses table statistics to provide a close approximation of the overall row count. However, it's instinctual for most users (and ourselves, if we're honest) to try and get a quick row count by doing a COUNT() query:

-- IoT dataset
SELECT count(*) FROM hourly_trip_stats;

-- Stock transactions dataset
SELECT count(*) FROM five_min_candle_delta;

And most users recognized that in previous versions of TimescaleDB, the materialized data seemed slower than normal to do a COUNT over.

Thinking about our two example datasets, both continuous aggregates reduce the overall row count from raw data by 20x or more. So, while counting rows in PostgreSQL is slow, it always felt a little slower than it had to be. The reason was that not only did PostgreSQL have to scan and count all of the rows of data, it had to group the data a second time because of some additional data that TimescaleDB stored as part of the original design of continuous aggregates. With the new design of continuous aggregates in TimescaleDB 2.7, that second grouping is no longer required, and PostgreSQL can just query the data normally, translating into faster queries.

Performance of a query with SELECT COUNT (*) in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7

Query #2: SELECT COUNT(*) Based on The Value of a Column

Another common query that many analytic applications perform is to count the number of records where the aggregate value is within a certain range:

-- IoT dataset
SELECT count(*) FROM hourly_trip_stats
WHERE avg_fare > 13.1
AND bucket > '2018-01-01' AND bucket < '2019-01-01';

-- Stock transactions dataset
SELECT count(*) FROM five_min_candle_delta
WHERE change_pct > 0.02;

In previous versions of continuous aggregates, TimescaleDB had to finalize the value before it could be filtered against the predicate value, which caused queries to perform more slowly. With the new version of continuous aggregates, PostgreSQL can now search for the value directly, and we can add an index to meaningful columns to speed up the query even more!

In the case of the financial dataset, we see a very significant improvement: 1,336x faster. The large change in performance can be attributed to the formula query that has to be calculated over all of the rows of data in the continuous aggregate. With the IoT dataset, we're comparing against a simple average function, but for the stock data, multiple values have to be finalized (FIRST/LAST) before the formula can be calculated and used for the filter.

Performance of a query with SELECT COUNT (*) plus WHERE in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7

Query #3: Select Top 10 Rows by Value

Taking the first example a step further, it's very common to query data within a range of time and get the top rows:

-- IoT dataset
SELECT * FROM hourly_trip_stats
ORDER BY avg_fare desc
LIMIT 10;

-- Stock transactions dataset
SELECT * FROM five_min_candle_delta
ORDER BY change_pct DESC
LIMIT 10;

In this case, we tested queries with the continuous aggregate set to provide real-time results (the default for continuous aggregates) and materialized-only results. When set to real-time, TimescaleDB always queries data that's been materialized first and then appends (with a UNION) any newer data that exists in the raw data but that has not yet been materialized by the ongoing refresh policy. And, because it's now possible to index columns within the continuous aggregate, we added an index on the ORDER BY column.

Performance of a query with ORDER BY in a continuous aggregate TimescaleDB 2.6.1 and TimescaleDB 2.7

Yes, you read that correctly. Nearly 45,000x better performance on ORDER BY when the query only searches through materialized data.

The dramatic difference between real-time and materialized-only queries is because of the UNION of both materialized and raw aggregate data. The PostgreSQL planner needs to union the total result before it can limit the query to 10 rows (in our example), and so all of the data from both tables need to be read and ordered first. When you only query materialized data, PostgreSQL and TimescaleDB knows that it can query just the index of the materialized data.

Again, storing the finalized form of your data and indexing column values dramatically impacts the querying performance of historical aggregate data! And all of this is updated continuously over time in a non-destructive way—something that's impossible to do with any other relational database, including vanilla PostgreSQL.

Query #4: Timescale Hyperfunctions to Re-aggregate Into Higher Time Buckets

Another example we wanted to test was the impact finalizing data values has on our suite of analytical hyperfunctions. Many of the hyperfunctions we provide as part of the TimescaleDB Toolkit utilize custom aggregate values that allow many different values to be accessed later depending on the needs of an application or report. Furthermore, these aggregate values can be re-aggregated into different size time buckets. This means that if the aggregate functions fit your use case, one continuous aggregate can produce results for many different time_bucket sizes! This is a feature many users have asked for over time, and hyperfunctions make this possible.

For this example, we only examined the New York City Taxicab dataset to benchmark the impact of finalized CAGGs. Currently, there is not an aggregate hyperfunction that aligns with the OHLC values needed for the stock data set, however, there is a feature request for it! (😉)

Although there are not currently any one-to-one hyperfunctions that provide exact replacements for our min/max/avg example, we can still observe the query improvement using a tdigest value for each of the columns in our original query.

Original min/max/avg continuous aggregate for multiple columns:

Hyperfunction-based continuous aggregate for multiple columns:

CREATE MATERIALIZED VIEW hourly_trip_stats_toolkit
WITH (timescaledb.continuous, timescaledb.finalized=false)
AS
SELECT
time_bucket('1 hour',pickup_datetime) bucket,
tdigest(1,fare_amount) fare_digest,
tdigest(1,trip_distance) distance_digest,
tdigest(1,congestion_surcharge) surcharge_digest,
cab_type_id,
passenger_count
FROM
trips
GROUP BY
bucket, cab_type_id, passenger_count

With the continuous aggregate created, we then queried this data in two different ways:

*1. Using the same time_bucket() size defined in the continuous aggregate, which in this example was one-hour data.
*
SELECT
bucket AS b,
cab_type_id,
passenger_count,
min_val(ROLLUP(fare_digest)),
max_val(ROLLUP(fare_digest)),
mean(ROLLUP(fare_digest))
FROM hourly_trip_stats_toolkit
WHERE bucket > '2021-05-01' AND bucket < '2021-06-01'
GROUP BY b, cab_type_id, passenger_count
ORDER BY b DESC, cab_type_id, passenger_count;

_Performance of a query with time_bucket() in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7 (the query uses the same bucket size as the definition of the continuous aggregate) _

**2. We re-aggregated the data from one-hour buckets into one-day buckets. **This allows us to efficiently query different bucket lengths based on the original bucket size of the continuous aggregate.

SELECT
time_bucket('1 day', bucket) AS b,
cab_type_id,
passenger_count,
min_val(ROLLUP(fare_digest)),
max_val(ROLLUP(fare_digest)),
mean(ROLLUP(fare_digest))
FROM hourly_trip_stats_toolkit
WHERE bucket > '2021-05-01' AND bucket < '2021-06-01'
GROUP BY b, cab_type_id, passenger_count
ORDER BY b DESC, cab_type_id, passenger_count;

Performance of a query with time_bucket() in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7. The query re-aggregates the data from one-hour buckets into one-day buckets

In this case, the speed is almost identical because the same amount of data has to be queried. But if these aggregates satisfy your data requirements, only one continuous aggregate would be necessary in many cases, rather than a different continuous aggregate for each bucket size (one minute, five minutes, one hour, etc.)

Query #5: Pivot Queries With FILTER

In previous versions of continuous aggregates, many common SQL features were not permitted because of how the partial data was stored and finalized later. Using a PostgreSQL FILTER clause was one such restriction.

For example, we took the IoT dataset and created a simple COUNT(*) to calculate each company's number of taxi rides ( cab_type_id) for each hour. Before TimescaleDB 2.7, you would have to store this data in a narrow column format, storing a row in the continuous aggregate for each cab type.

CREATE MATERIALIZED VIEW hourly_ride_counts_by_type
WITH (timescaledb.continuous, timescaledb.finalized=false)
AS
SELECT
time_bucket('1 hour',pickup_datetime) bucket,
cab_type_id,
COUNT(*)
FROM trips
WHERE cab_type_id IN (1,2)
GROUP BY
bucket, cab_type_id;

To then query this data in a pivoted fashion, we could FILTER the continuous aggregate data after the fact.

SELECT bucket,
sum(count) FILTER (WHERE cab_type_id IN (1)) yellow_cab_count,
sum(count) FILTER (WHERE cab_type_id IN (2)) green_cab_count
FROM hourly_ride_counts_by_type
WHERE bucket > '2021-05-01' AND bucket < '2021-06-01'
GROUP BY bucket
ORDER BY bucket;

In TimescaleDB 2.7, you can now store the aggregated data using a FILTER clause to achieve the same result in one step!

CREATE MATERIALIZED VIEW hourly_ride_counts_by_type_new
WITH (timescaledb.continuous)
AS
SELECT
time_bucket('1 hour',pickup_datetime) bucket,
COUNT() FILTER (WHERE cab_type_id IN (1)) yellow_cab_count,
COUNT() FILTER (WHERE cab_type_id IN (2)) green_cab_count
FROM trips
GROUP BY
bucket;

Querying this data is much simpler, too, because the data is already pivoted and finalized.

SELECT * FROM hourly_ride_counts_by_type_new
WHERE bucket > '2021-05-01' AND bucket < '2021-06-01'
ORDER BY bucket;

This saves storage (50 % fewer rows in this case) and CPU to finalize the COUNT(*) and then filter the results each time based on cab_type_id. We can see this in the query performance numbers.

_Performance of a query with FILTER in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7. _

Being able to use FILTER and other SQL features improve both developer experience and flexibility long term!

Query #6: HAVING Stores Significantly Less Materialized Data

As a final example of how the improvements to continuous aggregates will impact your day-to-day development and analytics processes, let's look at a simple query that uses a HAVING clause to reduce the number of rows that the aggregate stores.

In previous versions of TimescaleDB, the having clause couldn't be applied at materialization time. Instead, the HAVING clause was applied after the fact to all of the aggregated data as it was finalized. In many cases, this dramatically affected both the speed of queries to the continuous aggregate and the amount of data stored overall.

Using our stock data as an example, let's create a continuous aggregate that only stores a row of data if the change_pct value is greater than 20 %. This would indicate that a stock price changed dramatically over one hour, something we don't expect to see in most hourly stock trades.

CREATE MATERIALIZED VIEW one_hour_outliers
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', time) AS bucket,
symbol,
FIRST(price, time) AS "open",
MAX(price) AS high,
MIN(price) AS low,
LAST(price, time) AS "close",
MAX(day_volume) AS day_volume,
(LAST(price, time)-FIRST(price, time))/LAST(price, time) AS change_pct
FROM stocks_real_time srt
GROUP BY bucket, symbol
HAVING (LAST(price, time)-FIRST(price, time))/LAST(price, time) > .02;

Once the dataset is created, we can query each aggregate to see how many rows matched our criteria.

SELECT count(*) FROM one_hour_outliers;

Table comparing the performance of a query with HAVING in a continuous aggregate in TimescaleDB 2.6.1 and TimescaleDB 2.7.

The biggest difference here (and the one that will more negatively impact the performance of your application over time) is the storage size of this aggregated data. Because TimescaleDB 2.7 only stores rows that meet the criteria, the data footprint is significantly smaller!

Storage footprint of a continuous aggregate bucketing stock transactions by the hour in TimescaleDB 2.6.1 and TimescaleDB 2.7

Storage Savings in TimescaleDB 2.7

One of the final pieces of this update that excites us is how much storage will be saved over time. On many occasions, users with large datasets that contained complex equations in their continuous aggregates would join our Slack community to ask why more storage is required for the rolled-up aggregate than the raw data.

In every case we've tested, the new, finalized form of continuous aggregates is smaller than the same example in previous versions of TimescaleDB, with or without a HAVING clause that might filter additional data out.

Storage savings for different continuous aggregates in TimescaleDB 2.6.1 and TimescaleDB 2.7

The New Continuous Aggregates Are a Game-Changer

For those dealing with massive amounts of time-series data, continuous aggregates are the best way to solve a problem that has long haunted PostgreSQL users. The following list details how continuous aggregates expand materialized views:

They always stay up-to-date, automatically tracking changes in the source table for targeted, efficient updates of materialized data.
You can use configurable policies to conveniently manage refresh/update interval.
You can keep your materialized data even after the raw data is dropped, allowing you to downsample your large datasets.
And you can compress older data to save space and improve analytic queries.
And in TimescaleDB 2.7, continuous aggregates got much better.

First, they are blazing fast: as we demonstrated with our benchmark, the performance of continuous aggregates got consistently better across queries and datasets, up to thousands of times better for common queries. They also got lighter, requiring an average of 60 % less storage.

But besides the performance improvements and storage savings, there are significantly fewer limitations on the types of aggregate queries you can use with continuous aggregates, such as:

Aggregates with DISTINCT
Aggregates with FILTER
Aggregates with FILTER in HAVING clause
Aggregates without combine function
Ordered-set aggregates
Hypothetical-set aggregates

This new version of continuous aggregates is available by default in TimescaleDB 2.7: now, when you create a new continuous aggregate, you will automatically benefit from all the latest changes. For your existing continuous aggregates, we recommend that you recreate them in the latest version to take advantage of all these improvements. Read our release notes for more information on TimescaleDB 2.7, and for instructions on how to upgrade, check out our docs.