DEV Community

Cover image for What Is Elasticsearch TSDS And Why We Migrated From Standard Indices
NARESH
NARESH

Posted on

What Is Elasticsearch TSDS And Why We Migrated From Standard Indices

Banner

TL;DR
Elasticsearch works extremely well for search, analytics, and observability workloads, but standard indices slowly become inefficient once telemetry data starts growing at large scale.

This blog explains why time-series workloads behave differently from normal application data, how Elasticsearch internally stores data using Lucene segments, and why Time Series Data Streams (TSDS) were introduced to optimize storage, routing, lifecycle management, and long-term retention for telemetry systems.

The blog also explores how TSDS internally organizes data using timestamps, backing indices, and dimensions, along with an important operational lesson:

If you are planning to move into TSDS, do it as early as possible before historical data grows into a large-scale migration problem.

This is not a setup tutorial. It is a systems-design-oriented deep dive into how Elasticsearch handles time-series data internally and why TSDS becomes important at scale.


Most engineers know Elasticsearch as a search engine or a logging platform. But once systems start generating telemetry and metrics data at large scale, Elasticsearch slowly becomes a storage architecture problem rather than just a search problem.

Assume a large-scale telemetry platform ingesting nearly 900GB to 1TB of metrics data every single day. At that scale, the challenge is no longer just about indexing documents or rendering dashboards. The real problem becomes storage growth, segment merge pressure, retention management, query efficiency, and infrastructure cost.

Within a few months, clusters can easily accumulate tens of terabytes of historical metrics data. Storing that much data using standard Elasticsearch indices becomes increasingly expensive, both operationally and financially. The problem is not just storing data, but storing it efficiently enough for long-term scalability.

This is where Elasticsearch Time Series Data Streams (TSDS) enters the picture.

But this blog is not another setup tutorial or migration guide. Instead, the goal here is to understand why TSDS exists, what architectural problem it solves, and how Elasticsearch internally handles time-series workloads.

More importantly, this blog approaches Elasticsearch from a systems-design perspective. Elasticsearch is not a general-purpose database, and understanding its storage model, segment architecture, routing behavior, and lifecycle management is critical before introducing TSDS into large-scale systems.

This blog focuses entirely on building that understanding. In the upcoming blogs, I'll go deeper into downsampling, historical reindexing, rollover behavior, and the operational challenges involved in large-scale TSDS migrations.


Why Elasticsearch Is Not Usually Used As A Standalone General-Purpose Database

One of the biggest misconceptions around Elasticsearch is that it can completely replace every other database in a system. Technically, Elasticsearch is capable of handling many general-purpose workloads, and several companies do use it beyond just search or observability use cases. Modern versions of Elasticsearch also provide features like replication, durability, and transactional guarantees at the document level.

But in real-world system design, Elasticsearch is usually not chosen as the primary database for highly transactional applications.

This is because Elasticsearch is architecturally optimized for a different class of workloads compared to databases like PostgreSQL or MySQL. Traditional relational databases are specifically designed around transactional consistency, relational queries, normalized data models, and frequent updates. Elasticsearch, on the other hand, is optimized for distributed search, aggregations, analytics, and high-volume ingestion workloads.

Internally, Elasticsearch is built on top of Lucene, which uses immutable segment-based storage. Instead of continuously modifying rows in place, Elasticsearch writes new segments and merges them over time. This architecture works extremely well for:

  • full-text search
  • observability platforms
  • logging systems
  • telemetry pipelines
  • analytics workloads
  • append-heavy ingestion systems

This is one of the main reasons Elasticsearch became extremely popular in monitoring and metrics platforms. Systems generating hundreds of gigabytes or even terabytes of telemetry data daily benefit heavily from Elasticsearch's distributed indexing and aggregation capabilities.

However, every architecture comes with tradeoffs.

Large-scale ingestion introduces segment merge pressure, storage overhead, and lifecycle management challenges. And once time-series workloads start growing rapidly, storing telemetry data using standard indices becomes increasingly inefficient both operationally and financially.


Why Time-Series Data Is Different

Before understanding TSDS, it is important to understand why time-series workloads behave very differently from normal application data.

Most traditional application databases deal with records that constantly change over time. Users update profiles, order statuses change, inventory values get modified, and transactions continuously alter existing rows. These systems are designed around mutable data.

Time-series data behaves almost the opposite way.

Telemetry metrics, infrastructure monitoring data, observability events, sensor readings, and operational statistics are usually written once and rarely modified again. The data keeps arriving continuously, always attached to a timestamp, and over time the volume becomes enormous.

More importantly, these systems are not usually queried for individual documents. Nobody realistically searches for one specific CPU metric generated at an exact second. Instead, the value comes from understanding patterns over time. Engineers care more about trends, spikes, averages, latency distribution, anomaly detection, and infrastructure behavior across larger time windows.

That changes how the storage engine should think about the data internally.

At that point, the challenge is no longer simply storing JSON documents. The real challenge becomes how efficiently the system can organize, compress, aggregate, and retain massive streams of timestamp-oriented data without continuously increasing storage and operational cost.

This is where standard indices slowly start becoming inefficient.

A normal index treats telemetry documents almost like generic application documents, even though time-series data is far more predictable in nature. It arrives sequentially, follows strict temporal patterns, and is usually queried inside bounded time windows. Once the storage engine understands those patterns, it can optimize much more aggressively around storage layout, routing, compression, and lifecycle management.

That idea is the foundation behind Elasticsearch TSDS.

But before understanding how TSDS solves this problem, we first need to understand how Elasticsearch actually stores data internally through Lucene segments.


How Elasticsearch Actually Stores Data Internally

To understand why TSDS exists, we first need to understand one of the most important concepts inside Elasticsearch: Lucene segments.

Stores Data Internally

Most engineers interact with Elasticsearch through indices, documents, shards, and queries. But internally, Elasticsearch does not continuously modify documents the way traditional databases modify rows. Instead, Elasticsearch stores data inside immutable Lucene segments.

You can think of a segment like a sealed storage box containing a collection of indexed documents. Once that box is sealed, the data inside it is never modified directly again.

When new documents arrive, Elasticsearch does not reopen old segments and insert data into them. Instead, it creates new segments. As more data keeps getting indexed, more and more segments start accumulating inside the shard.

Over time, Elasticsearch performs segment merges in the background. Smaller segments get combined into larger segments to reduce fragmentation and improve query efficiency. This process is one of the most important internal behaviors of Elasticsearch because querying hundreds of tiny segments is significantly more expensive than querying a smaller number of larger optimized segments.

At small scale, this architecture works extremely well.

But once telemetry systems start generating massive continuous streams of time-series data, the behavior changes dramatically.

Imagine a platform continuously ingesting metrics every few seconds from thousands of devices, interfaces, or services. Elasticsearch keeps creating new segments continuously. Background merges become heavier. Disk I/O increases. CPU usage rises. Query fanout grows larger. And eventually, a significant portion of cluster resources starts getting consumed just managing segments internally.

This is one of the reasons why large-scale observability platforms become operationally expensive over time.

The important thing to understand here is that Elasticsearch is not inefficient. In fact, Lucene's segment architecture is one of the reasons Elasticsearch became extremely powerful for distributed search and analytics workloads. The real issue is that time-series data follows highly predictable patterns, while standard indices still treat those documents mostly as generic data.

That mismatch becomes increasingly expensive at scale.

This is exactly where TSDS changes the model. Instead of treating telemetry data like generic JSON documents, Elasticsearch starts organizing the data based on time-oriented behavior, routing patterns, and lifecycle awareness.

And once the storage engine understands that pattern, optimization becomes much more aggressive and much more efficient.


Why Standard Indices Become Inefficient For Time-Series Workloads

The important thing about time-series systems is that the value of the data changes over time, but standard indices do not naturally understand that behavior.

For example, raw telemetry collected every few seconds is extremely valuable for recent monitoring and debugging. But after a few weeks or months, most systems no longer need second-level granularity for historical analysis. At that stage, teams usually care more about trends, averages, spikes, and long-term behavioral patterns rather than every individual metric document.

The problem is that standard indices continue storing all historical data at the same granularity and storage cost, regardless of how the data is actually being used.

As ingestion volume grows, this creates a very expensive long-term storage model. Large-scale telemetry platforms can easily accumulate tens of terabytes of historical metrics data within a short period of time. Retaining all of that data in raw format increases storage cost, shard count, operational overhead, and query complexity together.

Another important issue is that historical queries usually become aggregation-heavy. Most dashboards and monitoring systems query data across bounded time ranges such as:

  • last 15 minutes
  • last 24 hours
  • last 30 days
  • last 6 months

But standard indices are not specifically optimized around time-aware storage behavior. They store telemetry documents similarly to generic application documents, even though time-series workloads follow highly predictable patterns.

This is where the inefficiency starts becoming architectural instead of operational.

At smaller scale, these limitations are usually manageable. But once ingestion reaches hundreds of gigabytes or nearly terabytes per day, long-term retention and storage efficiency become critical design problems rather than simple infrastructure concerns.

This is exactly why Elasticsearch introduced Time Series Data Streams (TSDS).

Instead of treating telemetry data like generic JSON documents, TSDS allows Elasticsearch to organize the storage model around timestamp-oriented behavior, lifecycle awareness, routing efficiency, and long-term retention optimization.


What Is Elasticsearch TSDS

Time Series Data Streams (TSDS) is Elasticsearch's specialized architecture for handling time-series workloads such as telemetry metrics, infrastructure monitoring, observability events, and operational statistics.

The important thing to understand is that TSDS is not simply a renamed index or a lightweight feature added on top of Elasticsearch. It fundamentally changes how Elasticsearch internally organizes and manages time-oriented data.

In a standard index, Elasticsearch stores incoming documents mostly as generic records without deeply understanding the structure of the workload itself. But time-series data follows highly predictable patterns. The data arrives continuously, is strongly tied to timestamps, and is usually queried across bounded time ranges rather than as individual documents.

TSDS takes advantage of that predictability.

Instead of continuously writing all incoming telemetry data into one generic storage structure, Elasticsearch starts organizing the data around time windows and lifecycle behavior. Incoming documents are automatically routed using their @timestamp values, while Elasticsearch internally manages multiple backing indices responsible for different timestamp ranges.

Another important concept inside TSDS is the separation between dimensions and metrics.

Dimensions are fields that identify the source of a metric stream. For example, fields such as device_name, interface_name, and parameter_name, together with the @timestamp, help define the identity of a time-series event.

Internally, Elasticsearch uses these dimensions to organize and route related metric streams more efficiently. Since telemetry systems continuously generate repeated measurements from the same logical sources over time, TSDS can optimize storage behavior and aggregation patterns much more effectively compared to standard indices.

At that point, Elasticsearch is no longer simply storing JSON documents. It starts behaving like a storage engine specifically optimized for representing time-oriented systems efficiently at scale.


How TSDS Works Internally

The most interesting part about TSDS is not the configuration itself, but how Elasticsearch internally changes its behavior once it recognizes that the workload is time-series in nature.

At the center of TSDS is the @timestamp field. Unlike normal indices where timestamps are usually treated as just another searchable field, TSDS uses timestamps as one of its core routing mechanisms. Every incoming document is evaluated based on its timestamp range, and Elasticsearch automatically determines which backing index should receive that document.

This is where backing indices become important.

A TSDS data stream is not a single physical index. Internally, Elasticsearch manages multiple hidden backing indices behind the data stream, where each backing index is responsible for a particular time range. As time progresses, Elasticsearch performs rollovers and newer backing indices are created for newer timestamp windows.

Because of this architecture, Elasticsearch no longer treats the entire telemetry dataset as one continuously growing storage structure. The data becomes naturally partitioned by time itself.

Another important optimization happens through dimensions.

In TSDS, dimensions act as stable identifiers for a metric stream. For example, if metrics are continuously generated from the same device, interface, and parameter combination, Elasticsearch understands that these fields belong to the same logical time-series pattern rather than unrelated documents.

Consider a document like this:

device_name = edge-router-01
interface_name = ge-0/0/0
parameter_name = cpu_usage
@timestamp = 2026-05-01T10:15:00Z

Internally, Elasticsearch uses the dimensions together with the timestamp information to organize and route related metric streams more efficiently. This improves aggregation locality, reduces unnecessary storage overhead, and makes telemetry-oriented queries significantly more efficient compared to standard indices.

The combination of timestamp-aware routing, backing indices, and dimension-oriented organization is what allows TSDS to optimize aggressively for observability and telemetry workloads.

And this optimization becomes increasingly valuable as historical data starts growing over time. Because at large scale, the challenge is no longer simply ingesting telemetry data. The real challenge becomes how efficiently the platform can retain, lifecycle-manage, aggregate, and query months of historical metrics without allowing infrastructure cost and operational complexity to grow uncontrollably.


Why TSDS Should Be Introduced Early

One of the biggest mistakes teams make with time-series architecture is assuming they can postpone TSDS migration until later.

At smaller scale, standard indices usually work without major visible issues. Dashboards load correctly, ingestion pipelines remain stable, and operational pressure feels manageable. Because of that, many systems continue building on top of standard indices for far longer than they probably should.

But time-series data grows much faster than most teams expect.

A telemetry platform ingesting hundreds of gigabytes or nearly terabytes of metrics data daily can accumulate massive historical datasets within a very short period of time. And once that happens, migration stops being a simple architectural improvement and starts becoming a serious operational challenge.

This is something I strongly want to emphasize from experience:

If you are planning to move into TSDS, do it today. Or at least do it before your historical data grows beyond a manageable size.

Because once historical telemetry data becomes extremely large, the complexity changes completely.

For present and future ingestion workflows, TSDS integration is usually smooth. Incoming data naturally follows the expected timestamp behavior, backing index lifecycle, and routing patterns. Operationally, that part is relatively straightforward.

The real complexity starts when historical data enters the picture.

Migrating historical standard indices into TSDS is fundamentally different from handling live ingestion. At that stage, you are no longer simply moving documents between indices. You are dealing with timestamp-bound routing, rollover coordination, backing index constraints, lifecycle timing, and large-scale reindex behavior simultaneously.

For example, once rollover happens, newer backing indices may only accept newer timestamp ranges, while historical documents still belong to older time windows. That single architectural detail alone can create unexpected migration challenges if the system is not planned carefully.

And the larger the historical dataset becomes, the harder this problem gets operationally.

Another thing many teams underestimate is that hardware scaling alone does not fully solve the problem. Increasing CPU, RAM, or storage capacity may temporarily improve throughput, but it does not fundamentally change how Elasticsearch internally handles routing behavior, lifecycle execution, segment management, or historical retention complexity.

At large scale, architecture decisions matter more than raw hardware.

This is why TSDS should be treated as an early architectural decision rather than a late-stage optimization task. Because once telemetry retention grows beyond a certain point, migration complexity, operational risk, infrastructure cost, and lifecycle overhead all start increasing together very quickly.


Conclusion

Time-series workloads change the way storage systems need to behave internally.

At smaller scale, standard Elasticsearch indices are usually sufficient. But as telemetry systems continuously generate metrics over long periods of time, the architecture challenges become very different from normal application workloads. Storage growth, retention strategy, lifecycle management, and long-term operational scalability slowly become more important than simply indexing documents quickly.

This is exactly why Elasticsearch introduced Time Series Data Streams (TSDS).

TSDS is not just another index type, and it is not some magical compression layer added on top of Elasticsearch. It is Elasticsearch recognizing that time-series workloads follow highly predictable patterns, and once the storage engine understands those patterns, it can optimize much more efficiently around routing, storage organization, and long-term retention behavior.

More importantly, TSDS should not be treated as a late-stage optimization task.

If there is one thing I would strongly recommend from experience, it is this:

If you are planning to move into TSDS, do it as early as possible.

Because integrating TSDS for present and future ingestion is relatively straightforward. The real complexity starts when massive amounts of historical telemetry data already exist and migration becomes operationally difficult.

In the upcoming blogs, I'll go deeper into the practical side of this journey downsampling, historical reindexing, rollover behavior, migration strategies, and the production-scale challenges that appear once historical data enters the picture.

But before solving those operational problems, understanding how TSDS works internally is the most important foundation. Because once you understand the architecture, many of Elasticsearch's behaviors start making much more sense.


🔗 Connect with Me

📖 Blog by Naresh B. A.

👨‍💻 Building AI & ML Systems | Backend-Focused Full Stack

🌐 Portfolio: [Naresh B A]

📫 Let's connect on [LinkedIn] | GitHub: [Naresh B A]

Thanks for spending your precious time reading this. It's my personal take on a tech topic, and I really appreciate you being here. ❤️

Top comments (0)