DEV Community: Jeff Tao

Is Closed-Source Software Really More Secure?

Jeff Tao — Fri, 19 Apr 2024 03:29:47 +0000

When I heard about the backdoor in xz Utils that was recently made public, I was struck not only by how serious the issue could have been, but also by how quickly it was resolved. A nefarious plot at least a year in the making had been uncovered almost as soon as it was released, with only a few distributions, none typically used in production environments at enterprises, even shipping the affected version. For me, being an open-source entrepreneur and developer, this episode only strengthens my belief that in our community, vulnerabilities are indeed shallow.

But the same cannot be said for many of my customers. While open source is now widely accepted in the tech world, the traditional industries that my company TDengine serves are not all on board just yet. Although this backdoor does not appear to have caused harm to actual systems, it has at least temporarily affected the perceived security of open-source software, and some of the old talking points and misconceptions have already begun to resurface.

A Question of Trust

With the issue fresh in people’s minds, early last week I started receiving calls from users and fielding concerns from potential customers about the security of open-source software: “Since everyone can see the code, doesn’t that mean they can hack it whenever they like? Aren’t you just giving bad actors the tools they need to attack our systems and steal our data?”

Those of us who have been in the open-source community for some time are certainly familiar with this narrative, which proprietary software vendors pushed hard in the 1990s and 2000s, before they realized that they could make money from open source. For customers in industries like manufacturing and energy whose enterprise software has always been closed source, the idea of security through obscurity really seems reasonable.

Moreover, these industries are core components of our nation’s infrastructure and economy, and as such are huge targets for “hackers.” System stability and information security are of the utmost importance not only to enterprises, but also to ordinary people who use their services or buy their products. It’s only natural that customers in these industries are extremely security-sensitive and wary of deploying any software system that they don’t trust 100%.

Open Discussion, Open Resolution

What I remind my customers is that proprietary software is just as likely to have security vulnerabilities as open-source software — but you are less likely to hear about it. Do you really believe that your vendors will notify you every time they identify security issues in their code? In most cases, unless there is a legal requirement for disclosure, they keep their issues under wraps, and you may never know whether you were at risk. This presents an even greater danger, as you cannot mitigate threats when you are not aware of them.

When software development is open, on the other hand, a multitude of eyeballs can potentially review every line of code. Open-source projects like TDengine do not hide behind the veil of proprietary licensing, but invite all developers to search for vulnerabilities. In this development model, customers can rest assured that security issues will be promptly and publicly identified and resolved, and that they can take appropriate action to mitigate or remediate threats before it’s too late.

The truth is that closed-source software can still be exploited by bad actors — Microsoft is certainly no stranger to vulnerabilities, for example, despite most of their products being proprietary. No one believes that locks are unpickable and safes are uncrackable because the manufacturers don’t publish schematics; why should software be any different? In fact, software vendors that don’t release their source code are not keeping the bad guys out – they’re only preventing the good guys from helping.

For industrial enterprises, the myth that closed-source software is more secure needs to end. These companies often do not have the luxury of large IT departments and rely on vendors to ensure the security of their products. By moving to open source, they stand to gain at no cost an entire community of experts to keep their vendors honest and their systems secure.

Developers: Stop Donating Your Work to Cloud Service Providers!

Jeff Tao — Thu, 28 Mar 2024 06:11:58 +0000

On March 20, Redis announced its transition from three-clause BSD to a dual-licensed structure with the Redis Source Available License and the Server Side Public License (SSPL), resulting immediately in much consternation among the open-source community, as evidenced on the GitHub pull request making this change as well as on Hacker News, Reddit, and other platforms where the community congregates. By now it’s well-known that software released under the SSPL is not considered open source by the Open Source Initiative (OSI), because the license violates criterion 6 of the Open Source Definition by discriminating against those making the software available as a service.

This is the latest in a long line of open-source companies moving to a more restrictive license — Terraform was moved from the Mozilla Public License to the Business Source License last year, Elasticsearch switched from Apache 2.0 to the SSPL in 2021, and Grafana changed from Apache 2.0 to the GNU Affero General Public License (AGPL) later that year, just to list a few. Each of these instances caused at least some backlash in the community, with developers decrying them as restricting their freedoms and (except for the AGPL, which somehow achieved OSI approval) moving away from open source.

Have all these companies used the open-source community simply as a way of promoting their products, with the intention from day one to pull the rug out from under the community and steal their work? Or have they sold their souls for sweet venture capital and been corrupted by MBAs lurking in their leadership? This is often the prevailing sentiment in online discussion.

The Newest Threat to Open Source

The truth is that in the war for software freedom, the battlefield has shifted. While many developers continue to focus on the traditional enemy of proprietary software, a greater foe has emerged in the form of cloud service providers. Instead of openly demonizing free software like the Microsofts of the 1990s, their attack on free software is much more insidious. They don’t want to kill open source; they want it to grow. They sponsor all the conventions and initiatives and foundations, proudly proclaiming their support for free software, and even assigning developers to work full-time on open-source projects. They are happy to make these investments in open-source software because they know that they will be the only ones to profit from it.

The developer community may still be on the fence about this, but for open-source companies like mine, the age of innocence is over. In their relicensing announcement, Timescale referred to the new crop of SSPL alternates as “cloud protection licenses”, making the point quite clear. In the cloud-driven world of today, selling enterprise services for open-source software is no longer a sustainable business model when your entire product can be sold as a service by massive corporations. Building a better product and stronger community is no longer a recipe for success when cloud service providers are free to package and sell your product, contribute little more than lip service, and encourage your community to push back against any action you take to benefit from the success of your own work. Independent developers may question the motives of for-profit companies, but my open-source time-series database project TDengine now pays the salaries of over 50 developers, all of whom contribute to our codebase on GitHub. It would be impossible to support this development team financially if we had to compete against the cloud providers to sell our own product.

Some developers claim that software freedom demands cloud service providers enjoy the same rights as individual developers, similar to those who insist that it is unfair for billionaires to pay higher taxes. They believe that restricting cloud service providers from exploiting your work goes against the spirit of open source as defined by the OSI, whose programs, according to their website, are “supported by Amazon, Google Open Source, and people like you.” (In what appears to be an oversight, Microsoft did not spend enough to be featured on the homepage.) And cloud service providers are delighted to have these principled absolutists on their side. Every line of MIT or BSD code that they write is a free contribution to the SaaS products of tomorrow.

If you are dedicated to working for massive corporations for free, don’t worry — cloud service providers are always quick to support the community in creating a fork. It happened with OpenSearch in 2021 and is happening with Redis forks as we speak — though it is surely just a coincidence that an AWS employee is driving the effort. Your ability to put more power in the hands of a few large companies cannot be hindered by license changes and will assuredly result in a better software landscape for the robber barons who exercise their freedom to use your code.

One Step Backward, Two Steps Forward

When I first turned my attention to free software in the early 2000s, the GNU General Public License (GPL) was the most popular licensing choice for open-source projects, and I distinctly remember the constant hand-wringing about licensing issues preventing the introduction of code into GPL repositories. For developers, moving toward permissive licenses was a natural reaction — we wanted to make our code available to everyone and we wanted everyone’s code to be available to us. We wanted to build and support a community based on mutual contribution and benefit, not get bogged down in the murky legal details and pointless ideological battles associated with copyleft.

The consequences of that shift are only now becoming apparent. In our quest for freedom, we empowered those in a position to exploit us. For the sake of convenience, we put individual developers and small businesses at the mercy of large corporations and consolidated power in the oligarchy of the public cloud. Now it is time for all of us in the open-source community to realize that releasing projects under permissive licenses is in essence a donation of our time and effort to the cloud service providers who will snap them up and sell them back to us.

In 2019 when I open-sourced my distributed time-series database TDengine, I chose the AGPL specifically to stay true to open-source principles while preventing unilateral monetization by cloud service providers. The AGPL is not perfect, nor are attempts to improve on it like the SSPL, but I am confident that it is the best choice for the present and equally confident that the greater community, once the severity of the current situation is realized, will come together again and create even better licensing options for future open-source projects. For now, I call on open-source companies and developers alike to open their eyes and stop donating your work to cloud service providers. Protect your rights by using copyleft licensing for any serious projects. If someone gets paid for your work, it should be you.

How to Choose the Best Time-Series Database

Jeff Tao — Mon, 08 May 2023 06:36:13 +0000

With the ever-increasing number of database management solutions on the market, how can you decide which time-series database (TSDB) is best for your use case? The following list shows the top 10 criteria for choosing the best time-series database:

Open source: You don't want to build your system on a black box, especially when there are many open-source products available. In addition to transparency, open-source products also have better ecosystems and developer communities and prevent vendor lock-in.
Performance: All time-series databases perform better than general databases when processing time-series data, but some have an issue with high cardinality, meaning that performance deteriorates when the number of metrics in the database gets higher. Also, some time-series database management systems experience unacceptable latency when accessing historical data. When you select a time-series database, make sure that it performs well with a data set similar in size to what you'll have in production – not just now but in the future as well.
Scalability: As your business grows, your data will too – that's why the best time-series database solutions need horizontal scalability. This is a weak spot for many current solutions, and even InfluxDB, the most popular time-series database, locks scalability away in its enterprise edition.
Query language: SQL is still the most popular query language among database management systems: it's powerful, fast, and already known by millions of developers and administrators. However, some time-series databases, like InfluxDB, Prometheus, and OpenTSDB, use proprietary query languages instead of SQL. This makes these systems more difficult to learn, even for experienced users, and greatly increases the cost of migrating from a traditional database. Because TDengine and TimescaleDB retain SQL as the query language, they are much simpler options for deploying a new time-series database.
Ecosystem: Considering the number of devices and sensors that generate time-series data, the best time-series database solutions need to provide connectors in major programming languages in addition to REST APIs. Different methods for data ingestion as well as integration with a variety of visualization and BI tools are also essential.
Cloud native: It won't be long before most systems, including time-series databases, are running in the cloud. For that reason a cloud-native time-series database is the most future-ready choice, though you should ensure that your solution is really cloud-native, not just "cloud-ready."
Extra features: Modern data platforms do more than just store data. You need a time-series database solution that supports features like continuous queries, caching, stream processing, and data subscription – otherwise, you'll have to integrate with specialized tools or implement them yourself, and that makes your system more complex and more expensive.
Out-of-order data: In some time-series databases, like Prometheus, data points that are received out of order cannot be processed and are just thrown away. If out-of-order data may occur in your use case – for example, if your message queue is in the middle of your data path, or simply if you encounter network issues – you need to be sure that your database solution can handle that data.
System footprint: Depending on where and how your data is collected, such as on the edge, you might not be able to deploy a large-scale system and instead need a lightweight solution.
Monitoring: The best time-series database solutions provide good observability as well as integration with monitoring tools like Grafana – otherwise, you won't be able to know whether issues have occurred until it's too late.

By keeping these criteria in mind, you'll be able to select the best time-series database for your business needs. But to be even more sure, it's a good idea to test the database yourself.

With TDengine, this is a simple process. TDengine can be installed in seconds on any major Linux distribution as well as macOS or Windows, and it includes the taosBenchmark tool that generates a sample data set for you. You can set up a test deployment and run your test queries at no cost and with minimum effort.

Why I’m Still Obsessed with the Time Series Database

Jeff Tao — Wed, 26 Apr 2023 10:09:53 +0000

The idea of a purpose-built time series database (TSDB) is not new. Looking back at the history of the field, RRDtool, which came out in 1999, was probably the first time series database. However, it wasn’t until 2015 that time series databases started gaining in popularity – but over the past few years, time series databases have become one of the fastest trending database management systems.

At the end of 2016, I saw that a new era in information technology was beginning. Everything was becoming connected: from home appliances to industrial equipment, everything was becoming “smart” – and generating massive amounts of data. And not just any data, but time series data. I quickly realized that the efficient processing of time series data coming from these smart sensors and devices would become an essential element of technological development going forward, so I started to develop TDengine, a new time series database system, in June 2017.

When I first started TDengine back in 2017, my main concern was whether the market still had room for another time series database. I constantly pondered whether existing databases were good enough to run time series applications, and whether they already met business needs. Even today, this is still something that I think about all the time, asking myself: is this time series database really worth my dedication and obsession?

Now let me share with you my conclusions based on a technical analysis. I’ll go over the key elements of time series data processing one by one:

Scalability

Due to the expansion of IT infrastructure and the advent of the Internet of Things (IoT), the scale of data is growing rapidly. A modern data center may need to collect up to 100 million metrics – everything from network equipment and servers to virtual machines, containers, and microservices is constantly sending out time-series data. As another example, every smart meter in a distributed power grid generates at least one data point every minute, and there are over 102.9 million smart meters in the United States. It is impossible for a single machine to handle this much data, so any system designed to process time-series data must be scalable.

However, many of the market-leading time-series databases do not provide a scalable solution. Prometheus, the de facto standard time-series database for Kubernetes environments, does not have a distributed design, and it has to rely on Cortex, Thanos, or other third-party tools for scalability. InfluxDB offers clustering only to enterprise customers, not as open-source software.

To get around this, many developers build their own scalable solutions by deploying a proxy server between their application and their TSDB servers (like InfluxDB or Prometheus). The collected time-series data is then divided among multiple TSDB servers based on the hash of the time-series ID.

For data ingestion purposes, this does solve the issue of scalability. But at the same time, the proxy server has to merge the query results from each underlying node, posing a major technical challenge. For some queries, like standard deviation, you can’t just merge the results – you have to retrieve the raw data from each node. This means that you need to rewrite the entire query engine, which makes for a huge amount of development work.

A closer inspection of the design of InfluxDB and TimeScaleDB shows that their scalability is actually quite limited. They store metadata in a central location, and each time series is always associated with a set of tags or labels. That means if you have one billion time series, the system needs to store one billion sets of tags.

You can probably see what the issue is here: when you aggregate multiple time series, the system needs to determine which time series meet the tag filtering conditions first, and in a large dataset, this causes significant latency. This is known as the high-cardinality problem of time-series databases. And how can we solve this problem? The answer is a distributed design for metadata processing. Metadata cannot be stored in a central location, or it will quickly become a bottleneck. One simple solution is to use a distributed relational database for metadata, but this makes the system complicated, harder to maintain, and more expensive.

TDengine 1.x was designed to have all metadata stored on the management node, so it suffers from high cardinality, too. We made some improvements in the time series database architecture of TDengine 2.x, storing tag values on each virtual node instead of the central management node, but the creation of new time series and time taken to restart the system were still major bottlenecks. With the newly released TDengine 3.0, we were finally able to solve high cardinality completely.

Complexity

A database is a tool to store and analyze data. But time-series data processing requires more than just storage and analytics. In a typical time-series data processing platform, the TSDB is always integrated with stream processing, caching, data subscription, and other tools.

Stream Processing

Time-series data is a stream. To gain insight into operations faster or detect errors in less time, data points must be analyzed as soon as they arrive at the system. Thus stream processing is a natural fit for time-series data. Stream processing can be time-driven, producing new results at set intervals (known as continuous query), or data-driven, producing new results whenever a new data point arrives.

InfluxDB, Prometheus, TimescaleDB, and TDengine all support continuous query. This is very useful for monitoring dashboards, as all of the charts and diagrams can be updated periodically. But not all data processing requirements – ETL, for example – can be met by continuous query alone, and time-series databases need to support event-driven stream processing.

Before TDengine 3.0, no time-series database on the market had out-of-the-box support for event-driven stream processing. Instead, time-series data platforms are integrated with Spark, Flink, or other stream processing tools that are not designed for time-series data. These tools have difficulty processing the millions or even billions of streams in time-series datasets, and even if they are up to the task, it comes at the price of a huge amount of computing resources.

Caching

For many time-series data applications, like application performance monitoring, the values of data at specific times are not important. These applications are focused only on trends. However, IoT scenarios are a notable and important exception. For example, a fleet management system always wants to know the current position of each truck. For a smart factory, the system always needs to know the current state of every valve and the current reading of every meter.

Most time-series databases, including InfluxDB, TimescaleDB, and Prometheus, cannot guarantee on their own that the latest data point of a time series can be returned with minimum latency. To enable the current value of each time-series to be returned without high latency, these data platforms are integrated with Redis. When new data points arrive at the system, they have to be written into Redis as well as the database. While this solution does work, it increases the complexity of the system and the cost of operations.

TDengine, on the other hand, has supported caching from its first release. In many use cases, Redis can be completely removed from the system, making the overall data platform much simpler and less expensive to run.

Data Subscription

The message queue plays an important role in many system architectures. Incoming data points are first written into a message queue and then consumed by other components in the system, including databases. The data in the message queue is usually kept for a specified period of time (seven days in Kafka, for example). This is the same as the retention policy in a time-series database.

Most time-series databases ingest data very efficiently, up to millions of data points a second. This means that if time-series databases can provide data subscription functionality, they can replace message queues entirely, again simplifying system design and reducing costs.

In a time-series database, incoming data points are stored in the write-ahead log (WAL) in append-only mode. This WAL file is normally removed once the data in memory is persisted to the database, and used only to recover data if the system crashes. However, if we don’t remove the WAL file automatically but keep it for a specified period, the WAL file can become a persistent message queue and be consumed by other applications.

Providing data subscription via the WAL file has another big benefit: it enables filtering on data subscription. The system can save resources by passing only the data points that meet the filtering conditions to applications.

Of all the time-series databases available today, TDengine is the only one with data subscription. With TDengine 3.0, we are now able to provide comparable performance and the same API set as Kafka.

Summary

While time-series databases still work without event-driven stream processing, caching, and data subscription, developers are forced to integrate their TSDBs with other tools to achieve the required functionality. This makes the system design overly complicated, requires more resources, and is harder to maintain. With these features built inside a time-series database, the overall system architecture is simplified and the cost of operation is reduced significantly.

Cloud Native

The most beautiful thing about cloud computing is its elasticity – storage and compute resources are essentially infinite, and you only pay for what you need. This is one of the main reasons that all applications, including time-series databases, are moving to the cloud.

Unfortunately, most databases are just “cloud-ready”, not cloud-native. When you buy the cloud service provided by some database vendors, such as TimescaleDB, you need to tell the system how many virtual servers (including the CPU and memory configuration) and how many gigabytes of storage you want. Even if you don’t run any queries, you’re still forced to pay for the computing resources, and if your data grows in scale, you need to decide whether to buy more resources. By offering this kind of cloud solution, database service providers are really just reselling cloud platforms.

To fully utilize the benefits provided by the cloud platform, time-series databases must be cloud-native. To achieve this, they need to be redesigned with the following three points in mind:

Separation of compute and storage: In containerized environments, specific containers may be up or down at any time, but stored data is persistent. A traditional time series database architecture is unable to cope with this because it stores data locally.
Furthermore, to run a complicated query or do batch processing, more compute nodes need to be added dynamically to speed up the process.
Elasticity: The system must be able to adjust storage and compute resources based on workload and latency requirements. For compute resources, this is not a hard decision for the system to make. But storage resources are another story. To scale a distributed database up or down, its shards have to be merged or split while live data is coming or queries are running. Designing a system that can accomplish this is no easy task.
Observability: The status of a time-series database must be monitored together with the other components of the system, so a good database needs to provide full observability to make operations and management simpler.

At this point, any newly developed time series database design must be cloud native. While TDengine was designed from day one with a highly scalable distributed architecture, as of version 3.0 we now fully support the separation of compute and storage.

Ease of Use

While ease of use is a subjective term, we can attempt to list here a few criteria for a more user-friendly time-series database:

Query language: As SQL is still the most popular database query language, SQL support essentially eliminates the learning curve for developers. Proprietary query languages, on the other hand, require developers to use their valuable time to learn them and also increase migration costs to or from other databases. In TDengine, you can reuse existing SQL queries with little to no change, and some queries can even be simplified thanks to enhancements in TDengine.
Interactive console: An interactive console is the most convenient way for developers to manage running databases or run ad hoc queries. This remains true even with time-series databases deployed in the cloud.
Sample code: Developers do not have time to read entire documentation sets just to learn how to use a certain feature or API. New database systems must provide sample code for major programming languages that developers can simply copy and paste into their applications.
Data migration tools: A database management system needs to provide convenient and efficient tools to get data in and out of the database. The source and destination may be a file, another database, or a replica in a remote data center, and the transferred data may be a whole database, a set of tables, or data points in a time range.

Time for a New Time Series Database

Considering the above four major factors, I started to develop a new time series database from scratch in 2017. After many iterations, the TDengine Team and I are proud to have now released TDengine 3.0 on August 23, 2022.

TDengine 3.0 is a comprehensive, purpose-built platform designed from day one to meet the needs of next-generation time series applications. It features a true cloud-native time series database design in which both compute and storage resources are separated and can be changed dynamically based on the workload. It can be deployed on public, private or hybrid clouds. Its scalability is unprecedented, and it can support one billion time series while still outperforming other time series databases in terms of data ingestion rate and query latency.

Additionally, TDengine 3.0 simplifies system architecture and reduces operating costs with its built-in caching, stream processing (event- and time-driven), and data subscription features. It is no longer necessary to integrate a multitude of third-party components just to have the functionality you need for time-series data processing.

Most importantly, TDengine 3.0 helps you gain insight into your time-series data with analytical capabilities comparable to relational databases and support for standard SQL syntax with time-series-specific extensions.

Since we launched TDengine back in 2017, from 1.0 to 2.0 to 3.0, it has been highly recognized by a large number of enterprise customers and community users. It has gathered over 20,000 stars on GitHub since going open source in July 2019 and has quickly risen in the time series database ranking on DB-Engines.

TDengine 3.0 isn’t the end of the story for time series databases, but we believe that it solves all the major issues in the field today. If you’re interested in trying it yourself for free, you can download the installation package or check out the source code on our GitHub repository. As a developer, I would appreciate any comments, feedback, or even contributions that you may have to improve the product further.

What Is a Time-Series Database (TSDB) and Why Do I Need One?

Jeff Tao — Fri, 21 Apr 2023 06:56:04 +0000

A time-series database (TSDB) is a database management system that is optimized to store, process, and analyze time-series data.

Time-series data is a sequence of data points that represent the changes of a measurement or events over a time period. Each data point is always timestamped, and the sequence of data points is indexed or listed by timestamp. The data generated by sensors on industrial equipment, smart devices, IT monitoring systems, and stock market trades are all examples of time-series data.

It is possible to process time-series data with relational or NoSQL databases, but purpose-built time-series databases are optimized to handle the special characteristics of time-series data. This means that time-series databases are much more efficient in terms of ingestion rate, query latency, and data compression. In addition, time-series databases include special analytic functions and data management features so that you can develop applications more easily.

What Are the Characteristics of Time-Series Data?

The following characteristics are inherent to time-series data:

Timestamp: Every data point has a timestamp. The timestamp is the key for computing or analysis.
Structure: Unlike data from Internet applications, the metrics generated by devices or from monitoring are always structured. They have predefined data types or fixed lengths, and the structure will not change unless the device firmware is updated.
Stream-like: Data sources generate data at a constant rate like an audio or video stream. These data streams are independent from each other.
Stable flow: Unlike e-commerce or social media applications, time-series data traffic is stable over time and can be calculated and predicted given the number of data sources and the sampling period.
Immutability: A time-series data source generates each data point once, never correcting or updating existing data. Time-series data is generally append-only, similar to log data.

How Is Time-Series Data Used?

Time-series data is often used to look for insights into operations, raise alerts based on real-time analysis, and forecast future trends. The following characteristics are found in time-series data applications:

High write-read ratio: Internet applications like Twitter and LinkedIn have single posts that are read by millions of users, but raw time-series data points are scanned and analyzed mainly by applications and algorithms.
Retention policy: In general, time-series data is not stored forever. Organizations have a retention policy that defines the data lifecycle, and the data is deleted once its lifecycle is over.
Real-time analytics and computing: To detect abnormal behavior and raise alerts based on the collected data or aggregation results, time-series data must be computed in real time.
Query scope: Time-series data is always queried over a period of time or a set of data sources, and filters are used such that not all historical data is queried. In addition, data aggregation is always applied on all or a subset of the data sources with a filter condition.
Trends: In time-series data, single data points are usually not important. Instead, the focus is on how data trends over a period of time, such as changes in the past hour or day.

Time-series solutions like TDengine optimize their design based on these characteristics, which enables more efficient processing of time-series data and better performance compared with general databases.

Why Does Time-Series Data Require Specialized Databases?

Today, everything is online – meters, cars, elevators, assembly lines, and even bicycles are connected to the Internet. And all of these items are emitting a relentless stream of metrics and events. With the advent of IoT and the cloud, the volume of time-series data has begun growing exponentially in an unprecedented way. The massive size of time-series data sets is a major challenge for general database management systems like relational and NoSQL databases. In particular, the following aspects of time-series data are difficult for non-specialized databases to handle:

Data ingestion rate: In many time-series data scenarios, millions of data points are produced every second and need to be ingested in real time. Relational databases are not designed to handle this amount of data, and while NoSQL databases can be scaled to handle it, the amount of resources required quickly becomes prohibitive.
Query latency: Time-series applications often need to scan a huge number of data points to get an aggregation result, which can result in high latency. For example, it would take hours for a general database to calculate the average response time of all clicks on Amazon.com, by which time the aggregation result could be outdated.
Storage cost: Internet-connected devices and applications are generating data nonstop 24/7 – sometimes exceeding a terabyte in a single day. Because relational and NoSQL databases cannot compress this data efficiently, storage costs can become high very fast.

These issues mainly involve efficiency in processing large data sets, but there are also areas where general databases often do not support even the basic requirements of time-series applications:

Data lifecycle management: Once time-series data ages out, it is generally removed in batches, not one data point at a time.
Roll-up: Time-series data is rolled up based on a specified time window and saved into new table. In addition, raw data and rolled-up data may have different lifecycles and retention policies.
Special analytic functions: Besides the functions provided by general databases, time-series applications need functions like time-weighted average, moving average, cumulative sum, rate of changes, elapsed time for a specific state, and delta between two consecutive data points.
Interpolation: The database management system must be able to interpolate data based on the adjacent data points and rules in order to regularize data sets when required by applications or algorithms.
Continuous query: Time-series applications run queries in the background periodically over a sliding time window in order to populate dashboards, generate reports, and downsample data sets.
Session and state windows: Aggregation and analytic functions may be run over a session or state window, not just time – for example, consider a function that calculates average power consumption only when a machine is in the running state.

With general databases, developers are forced to write custom code to implement these features. Different data workloads require different database solutions – one size does not fit all. For time-series data, no matter the size of your data set, a purpose-built time-series database is the best tool for the job.

Why Are Time-Series Databases Becoming Popular?

Time-series databases are not new: they have been widely used in the financial and process industries for decades.

However, they are becoming popular now mainly due to the rapid growth of the IoT. As more and more devices are Internet-connected and constantly sending data – time-series, of course – to the cloud, an increasing number of sectors are becoming interested in purpose-built time-series databases. As production modernizes and control systems evolve into the Industrial Internet of Things (IIoT), the industrial applications of time-series data are becoming evident as well. Finally, IT infrastructure has been steadily expanding, and everything from servers, containers, and network devices to apps and microservices is being monitored, which also generates massive amounts of time-series data.

Technologically speaking, older time-series databases are often closed systems that use outdated architectures, and they cannot scale to support the growing volume of data. In the old days, a million time-series data points seemed like a huge number, but now millions and even billions of data points is nothing out of the ordinary. Furthermore, integrating legacy time-series solutions with popular data analysis tools like artificial intelligence and machine learning frameworks is difficult, if not impossible. These legacy systems cannot be moved to the cloud without significant effort, and their licensing models are no longer acceptable for modern applications.

The growing market and the limitations of older time-series databases leave space for a new generation of time-series databases. Over the past 10 years, at least 20 new time-series databases have been released on the market, with open-source time-series databases becoming particularly popular.

Learn more at tdengine.com.