NARESH

Posted on Jun 8

What Actually Happens Inside Elasticsearch TSDS During Live Ingestion

#elasticsearch #systemdesign #distributedsystems #architecture

Most TSDS articles usually focus only on the setup part.
Create an ILM policy. Create an index template. Create a data stream. Insert documents. Done.

But once telemetry platforms start ingesting hundreds of gigabytes or even terabytes of data continuously, the real challenge is no longer configuration. The real challenge becomes understanding what Elasticsearch is actually doing internally while handling live time-series ingestion at scale.

The official Elasticsearch documentation already explains the APIs and configuration flow very well. Instead of repeating that, this blog focuses on the practical side of TSDS from real implementation experience how live ingestion behaves internally, how rollover actually works, how backing indices evolve over time, and how ILM and downsampling interact with the ingestion pipeline in production systems.

We will also discuss two common approaches used in time-series architectures. One is the modern TSDS-native approach where Elasticsearch automatically manages backing indices and lifecycle behavior internally. The other is the operational approach where systems continue using date-based index patterns due to existing production constraints and migration requirements.

Most importantly, this blog focuses only on the "happy path" of TSDS - present and future ingestion where incoming telemetry naturally aligns with Elasticsearch's expected time windows and lifecycle behavior.

Because understanding this flow first becomes extremely important before dealing with the much harder problem: historical TSDS migration.

Two Common Approaches For Time-Series Ingestion

Before going deeper into TSDS internals, it is important to understand that not every telemetry platform follows the same ingestion architecture.

In most systems, there are usually two common approaches for handling time-series ingestion inside Elasticsearch.

The first approach is the more modern TSDS-native model where applications continuously write into a common data stream such as:
collector-metrics

In this architecture, Elasticsearch internally manages the backing indices, rollover lifecycle, timestamp windows, and write routing automatically. The ingestion pipeline simply keeps sending live telemetry while Elasticsearch handles the underlying storage organization in the background.

The second approach is more operationally driven and is commonly seen in already existing large-scale production systems where indices follow date-based naming patterns such as:
collector-metrics-2026-05-21

At first glance, this may look like an anti-pattern compared to modern TSDS architectures. But in real production environments, migration constraints, existing pipelines, retention workflows, and operational dependencies sometimes make this approach necessary.

In our case, the platform was already heavily dependent on date-based standard indices before TSDS migration started. Because of that, maintaining a similar ingestion structure during migration became operationally safer than redesigning the entire ingestion architecture at once.

This blog primarily focuses on the present and future ingestion path where live telemetry continuously flows into TSDS under normal operating conditions. Historical migration behaves very differently once older timestamps start interacting with rollover boundaries and backing index time windows, which we will cover separately in the next blog.

Before TSDS: Understanding The Ingestion Pipeline

One important thing to understand is that TSDS only solves the storage and lifecycle side of the problem. It does not replace the ingestion pipeline itself.

In a real telemetry platform, data usually flows through multiple stages before it finally reaches Elasticsearch.

A simplified ingestion flow usually looks something like this:

The producers continuously generate telemetry metrics, operational statistics, or monitoring events. These messages are then pushed into a queue or broker system where worker services consume them asynchronously and perform bulk ingestion into Elasticsearch.

The reason bulk ingestion becomes important is because telemetry systems are usually append-heavy workloads. Writing documents one by one becomes inefficient very quickly once ingestion volume starts increasing continuously.

This is where Elasticsearch performs extremely well.

Using the Bulk API, workers can efficiently batch thousands of telemetry documents together and push them into TSDS continuously. From the application side, the workflow looks relatively straightforward. But internally, Elasticsearch is simultaneously handling routing decisions, backing index selection, segment creation, refresh cycles, and lifecycle coordination in the background.

And this is exactly where TSDS starts becoming interesting.

What Makes TSDS Different From Standard Indices

At a high level, TSDS may look similar to a normal Elasticsearch index because applications still send JSON documents through the same ingestion APIs. But internally, the behavior changes significantly once Elasticsearch recognizes that the workload is time-series in nature.

In a normal index, Elasticsearch mainly treats incoming documents as generic records. The system focuses on indexing, searching, and distributing documents efficiently across shards, but it does not deeply optimize around time-based behavior.

Once a data stream is configured for time-series mode, Elasticsearch starts organizing ingestion around timestamps, dimensions, backing indices, and lifecycle-aware storage management.

This becomes important because telemetry workloads follow highly predictable patterns:

data arrives continuously
documents are append-heavy
timestamps mostly move forward
historical queries are aggregation-heavy
retention behavior changes over time

Instead of treating telemetry like one continuously growing generic index, Elasticsearch partitions the data across multiple backing indices based on time windows. Incoming documents are routed using their @timestamp, while dimensions help Elasticsearch organize related metric streams more efficiently internally.

Certain fields are configured as dimensions so Elasticsearch can logically group related telemetry streams together. But dimensions should represent stable identifiers rather than every field in the document because excessive dimensions can increase cardinality and storage overhead significantly.

This is the point where Elasticsearch slowly stops behaving like a generic document store and starts behaving more like a specialized telemetry storage engine optimized for long-term time-series workloads.

Creating The TSDS Architecture

Once the ingestion pipeline is ready, the next step is building the actual TSDS architecture inside Elasticsearch. At a high level, the setup usually involves four major components:

ILM Policy
Index Template
Data Stream
Live Ingestion Pipeline

The important thing to understand is that TSDS itself is not just a single index. It is a combination of lifecycle management, timestamp-aware routing, backing indices, and storage organization working together internally.

This is also where many engineers get confused while reading the official documentation because the setup steps look simple, but each configuration changes Elasticsearch's internal behavior significantly.

In our case, the ingestion flow was designed around continuous telemetry ingestion where workers consume metrics in bulk and continuously push them into Elasticsearch. The responsibility of Elasticsearch then becomes:

deciding which backing index should receive the document
handling rollover automatically
managing lifecycle transitions
coordinating downsampling
and organizing long-term storage efficiently

To make all of this work correctly, Elasticsearch needs a few foundational configurations first. The first and most important one is the ILM policy.

Understanding ILM Policy

Before creating a TSDS data stream, one of the most important things to understand is ILM, which stands for Index Lifecycle Management.

At a high level, ILM controls how an index behaves throughout its lifetime inside Elasticsearch. It defines:

when rollover should happen
when downsampling should start
when data should move into colder storage tiers
and when old data should eventually be deleted automatically

ILM is not exclusive to TSDS. It works perfectly fine with standard Elasticsearch indices as well, and many large-scale systems already use ILM for retention and storage management long before TSDS migration begins.

But when ILM and TSDS work together, the architecture becomes much more efficient for telemetry workloads.

Assume a platform ingesting nearly 1TB of telemetry data every day. Within a few months, the cluster can easily accumulate tens or even hundreds of terabytes of historical metrics data. Retaining all of that data at raw granularity becomes extremely expensive both operationally and financially.

ILM solves this by automatically moving data through different lifecycle phases depending on its age and usage pattern.

The first phase is the Hot phase.

This is where newly arriving telemetry data lives. Since the data is queried frequently, Elasticsearch keeps it optimized for fast writes and low-latency queries. Dashboards, alerts, and monitoring systems usually depend heavily on this layer.

As the data becomes older, it moves into the Warm phase.

This is commonly where downsampling begins. For example, telemetry arriving every 5 minutes may later be compacted into larger intervals such as 15 minutes or 30 minutes depending on retention requirements.

Internally, this is not a lightweight operation. Elasticsearch and Lucene continuously reorganize segments, aggregate metrics, and compact historical data into summarized representations. Aggressive interval jumps can increase computation cost significantly. For example, directly converting 5-minute telemetry into 1-hour buckets is much heavier than gradually compacting the data through smaller intervals.

After Warm comes the Cold phase.

At this stage, the data is queried much less frequently, so Elasticsearch prioritizes storage efficiency over query performance. Query latency becomes higher compared to Hot storage, but operational cost becomes significantly lower.

Then comes the Frozen phase.

This phase is usually associated with snapshot-backed object storage systems such as:

AWS S3
Google Cloud Storage (GCS)
Azure Blob Storage

Instead of keeping the full index mounted on expensive cluster storage, Elasticsearch can store snapshots in cheaper object storage layers. The data still exists, but queries may require partial mounting or retrieval from snapshot-backed storage, which naturally increases latency.

Finally, there is the Delete phase.

This is where Elasticsearch automatically removes old indices once the configured retention period expires. Without ILM, teams often manage this process manually. With ILM, retention becomes automated and lifecycle-aware.

At large scale, this entire lifecycle system becomes part of the architecture itself rather than just a storage optimization feature.

Creating The Index Template

Once the ILM policy is ready, the next step is creating the index template.

The template is one of the most important parts of the TSDS architecture because this is where Elasticsearch learns how the incoming telemetry data should behave internally.

At a high level, the template defines:

which index patterns belong to the data stream
which field acts as the timestamp
which fields are dimensions
how metrics should be stored
how rollover and lifecycle behavior should apply to future backing indices

This is also where TSDS starts becoming different from normal indices.

In a standard index, Elasticsearch mostly stores documents as generic JSON records. But once the template is configured for time-series mode, Elasticsearch starts treating incoming data as part of a continuously evolving telemetry stream.

A simplified template usually contains configurations like:

index.mode: time_series
index.routing_path
lifecycle policy attachment
timestamp mappings
metric mappings
dimension mappings

One important thing to understand here is that the template itself does not create the backing indices immediately. Instead, it acts like a blueprint that Elasticsearch will later use while creating future backing indices automatically during rollover.

This is where rollover becomes extremely important internally.

Assume there is a box that can hold only a limited amount of telemetry documents. Once that box reaches a configured threshold such as:

50GB
200 million documents
or a configured age limit

Elasticsearch seals that box and creates a new one automatically.

Internally, those boxes are the backing indices:

.ds-metrics-000001
.ds-metrics-000002
.ds-metrics-000003

Only one backing index remains writable at a time. Once rollover happens, the older backing index becomes immutable and Elasticsearch starts routing all new incoming telemetry into the next backing index automatically.

This entire behavior is controlled using the template and ILM policy working together behind the scenes.

And this is exactly why understanding rollover properly becomes extremely important before dealing with historical migration later on.

Creating The Data Stream

Once the template is ready, the next step is creating the actual data stream.

This is the point where Elasticsearch starts combining all the configurations together:

TSDS mode
ILM policy
rollover behavior
backing index management
timestamp-aware routing

One important thing to understand is that applications do not directly write into backing indices.

Instead, the application always writes into the data stream itself:

metrics-prod

Internally, Elasticsearch automatically decides which backing index should receive the incoming document based on the current writable index and timestamp boundaries.

For example, assume the current active backing index is:

.ds-metrics-prod-000004

All new incoming telemetry data will continuously flow into this backing index until one of the rollover conditions is reached:

max size
max documents
max age
manual rollover trigger

Once the threshold is reached, Elasticsearch seals the current backing index and creates the next writable backing index automatically:

.ds-metrics-prod-000005

After rollover:

000004 becomes read-only
000005 becomes the active write index
all future telemetry automatically routes into 000005

The important thing here is that the application itself usually does not know this rollover happened.

From the application perspective, it still writes into the same logical data stream continuously while Elasticsearch manages the underlying storage lifecycle internally.

This abstraction is one of the biggest advantages of data streams because the ingestion pipeline no longer needs to manually create indices, rotate aliases, or manage rollover coordination explicitly.

And once ingestion starts continuously flowing through the data stream, Elasticsearch begins building the full lifecycle pipeline in the background through backing indices, segment organization, rollover coordination, and ILM execution automatically.

What Actually Happens During Live Ingestion

Once the data stream becomes active, the ingestion flow feels surprisingly seamless from the application side. Workers continuously send telemetry documents through the Bulk API while Elasticsearch handles the routing and storage behavior internally.

A simplified telemetry document may look something like this:

{
  "@timestamp": "2026-05-21T10:15:00Z",
  "device_name": "edge-router-01",
  "interface_name": "ge-0/0/0",
  "parameter_name": "cpu_usage",
  "value": 42.7
}

From the application perspective, this is simply another JSON document being indexed into the data stream.

Internally, Elasticsearch performs multiple operations before the document is persisted.

The first thing Elasticsearch checks is the @timestamp field because TSDS heavily depends on time-aware routing. Based on the timestamp and the current writable backing index, Elasticsearch determines where the document should be written.

If the active backing index is:

.ds-metrics-prod-000005

then the incoming telemetry automatically gets routed into that backing index.

At this stage, Elasticsearch also starts organizing the incoming documents through Lucene segments. The data is not immediately merged into one large optimized structure. Instead, smaller immutable segments continuously get created in the background as ingestion keeps happening.

As telemetry volume grows:

more segments get created
background merges start running
segment compaction begins
rollover thresholds get evaluated continuously

All of this happens while ingestion is still actively running.

One important thing to understand is that rollover is not triggered randomly. Elasticsearch continuously monitors the active backing index using configured lifecycle conditions such as:

shard size
document count
index age

Once one of those thresholds is reached, Elasticsearch seals the current backing index and automatically creates the next writable backing index.

This is why TSDS ingestion usually feels "invisible" during healthy operation. The application keeps writing into the same logical data stream continuously while Elasticsearch silently manages rollover, backing indices, segment organization, and lifecycle execution underneath.

Why Sealed Backing Indices Become Important

One of the biggest architectural advantages of TSDS appears only after rollover happens.

When a backing index reaches its configured threshold, Elasticsearch seals that backing index and creates a new writable backing index for future telemetry ingestion.

At first glance, this may look like simple index rotation. But internally, this changes how Elasticsearch can manage storage much more efficiently.

Once a backing index becomes read-only:

no new telemetry enters that index
Lucene segments inside it stop continuously changing
Elasticsearch can now optimize those segments much more aggressively

This is extremely important because continuously writable indices are expensive to optimize heavily. New documents keep arriving, segments keep getting created, and background merges keep running continuously.

But once rollover seals a backing index, Elasticsearch now knows that the data inside that backing index is stable.

At that point, Elasticsearch can:

merge segments more efficiently
perform downsampling safely
move historical data into colder tiers
snapshot old backing indices
reduce long-term storage overhead

without affecting the current live ingestion pipeline.

This separation is one of the biggest reasons TSDS scales much better for telemetry workloads compared to storing everything inside one continuously growing index.

The current writable backing index focuses on handling live ingestion efficiently, while older sealed backing indices slowly transition into lifecycle optimization workflows through ILM.

The Real Benefit Is Not Just Downsampling

One important thing to understand is that storage optimization in TSDS does not start only after downsampling. The optimization begins much earlier once the data itself is stored as a proper time-series workload.

Even without downsampling, TSDS can already reduce storage usage significantly compared to standard indices.

For example, in our case, a standard index consuming nearly 800GB was reduced to around 550GB simply by migrating into TSDS without any downsampling enabled yet.

The reason is that TSDS internally organizes telemetry data very differently from generic indices. Since Elasticsearch already understands the workload is time-series in nature, it can optimize routing, dimensions, indexing structures, and storage layouts much more efficiently.

After introducing downsampling, the reduction became even more significant:

raw TSDS data: ~550GB
15-minute downsampled data: ~315GB
1-hour downsampled data: ~100GB

At scale, this changes infrastructure cost completely.

But these optimizations also come with tradeoffs.

TSDS is heavily optimized for aggregation-heavy telemetry workloads rather than generic search behavior. This works extremely well for dashboards, monitoring systems, observability queries, and historical analytics. But lifecycle design still matters because aggressive downsampling or poorly designed intervals can increase computational pressure significantly during background compaction.

For example, directly converting very high-frequency telemetry into large aggregation windows creates heavy background work because Lucene still needs to merge, compact, and reorganize large volumes of historical segment data internally.

This is why ILM configuration becomes extremely important.

The interval progression should remain balanced. Instead of jumping aggressively between intervals, lifecycle transitions should move gradually so the cluster can compact historical data more efficiently over time.

Another important operational consideration is force merge.

Force merge allows Elasticsearch to compact segments more aggressively after backing indices become stable and read-only. This can improve long-term storage efficiency and reduce query overhead for historical data. But force merge itself is also resource-intensive and should be planned carefully because it can significantly increase CPU, disk I/O, and merge pressure while running.

At large scale, lifecycle management becomes more of a systems-design problem than simply a storage problem. ILM policy design, rollover strategy, downsampling intervals, force merge behavior, and template configuration all directly affect how efficiently the cluster behaves over long retention periods.

And this is exactly why spending more time on ILM and template design early becomes extremely important. Because once telemetry retention starts growing continuously, changing those architectural decisions later becomes much harder operationally.

Conclusion

TSDS is not just another Elasticsearch feature added for observability platforms. It is Elasticsearch recognizing that telemetry workloads behave very differently from normal application data and optimizing the storage engine around those patterns.

Once live ingestion starts flowing continuously through TSDS, Elasticsearch begins coordinating rollover, backing index management, lifecycle execution, segment organization, and long-term retention automatically in the background. At smaller scale, these internal behaviors are easy to ignore. But once telemetry systems start generating hundreds of gigabytes or even terabytes of data continuously, these architectural decisions become extremely important.

The biggest lesson from practical experience is that TSDS should not be treated as a late-stage optimization task.

The earlier the lifecycle strategy, template design, rollover configuration, and retention architecture are planned correctly, the easier the system becomes to manage operationally over time.

Because once historical telemetry grows significantly, the problem changes completely.

And that is exactly what the next blog focuses on.

In the next part, we will go deep into historical TSDS migration, reindexing challenges, rollover failures, time-bound routing behavior, and the operational problems that start appearing once massive historical datasets enter the system.

🔗 Connect with Me

📖 Blog by Naresh B. A.

👨‍💻 Building AI & ML Systems | Backend-Focused Full Stack

🌐 Portfolio: [Naresh B A]

📫 Let's connect on [LinkedIn] | GitHub: [Naresh B A]

Thanks for spending your precious time reading this. It's my personal take on a tech topic, and I really appreciate you being here. ❤️

DEV Community

What Actually Happens Inside Elasticsearch TSDS During Live Ingestion

Top comments (0)