DEV Community

Cover image for Historical TSDS Migration At Scale: Lessons Learned From Real Production Data
NARESH
NARESH

Posted on

Historical TSDS Migration At Scale: Lessons Learned From Real Production Data

Banner

TL;DR
Historical TSDS migration is very different from normal TSDS ingestion. After multiple failed approaches, the process that worked was:

  • Create the ILM policy (Hot → Warm → Cold → Frozen).
  • Create a TSDS index template with start_time and end_time.
  • Create the data stream.
  • Reindex historical data into the data stream.
  • Remove the start_time and end_time constraints from the template.
  • Monitor source and destination document counts.
  • Once migration reaches ~98–99% completion, trigger rollover manually.
  • Attach the ILM policy only after migration completes.
  • Allow TSDS lifecycle management and downsampling to run normally.

The biggest lesson from this project is simple:
Move present and future data to TSDS as early as possible.
Treat historical migration as a separate problem.
For very large datasets, TSDS migration alone can provide significant storage savings even before downsampling.
Downsampling historical data at scale is possible, but the time and infrastructure cost should be evaluated carefully.

Interested only in the final implementation? Skip directly to "The Migration Strategy That Finally Worked" and come back later for the lessons learned from the failure modes.


The first two blogs in this series focused on understanding TSDS and how it behaves during normal live ingestion. In most cases, that part is relatively straightforward. Documents arrive continuously, rollover happens automatically, ILM executes as expected, and the system behaves exactly as Elasticsearch intends.

Historical migration is where things become interesting.

At first, migrating historical indices into TSDS sounds simple. Elasticsearch provides documentation, APIs, and recommended workflows for moving existing data into time-series data streams. Naturally, I assumed the migration would be mostly a configuration exercise.

I was wrong.

Over the last few months, I spent a significant amount of time experimenting with different migration approaches, validating assumptions, analyzing failures, and testing multiple implementations against production-scale datasets. Some approaches worked perfectly in development environments and completely failed in production. Others technically worked but became operationally impractical once data volume started growing.

This blog is the result of that journey.

Most of this article is not about the final solution. It is about the failure modes that led to the solution. Understanding those failures is important because they explain why certain migration strategies break down at scale and why the final approach was designed the way it was.

If you are only interested in the implementation itself, feel free to jump directly to the migration strategy section. But I would strongly recommend reading the entire blog first. The solution makes much more sense once you understand the problems it was designed to solve.

Most importantly, this is not an official migration guide. It is one possible approach that emerged from real production constraints, large historical datasets, and a considerable amount of trial and error. If you are planning a large-scale TSDS migration, the lessons in this blog may save you a significant amount of time.


The Assumption That Almost Broke Everything

After understanding how TSDS works during live ingestion, my initial assumption was simple:

Historical migration should follow the same process.

Create a TSDS data stream, attach an ILM policy, start reindexing the historical data, and let Elasticsearch handle the rest.

On paper, that sounds perfectly reasonable.

The problem is that historical migration and live ingestion are fundamentally different workflows.

During live ingestion, data arrives continuously in chronological order. Elasticsearch always knows where the document belongs, rollover happens naturally, and lifecycle execution follows the expected flow.

Historical migration is different.

Instead of handling continuously arriving data, you are replaying old data into a system that was primarily designed for forward-moving time-series ingestion.

That single difference changes everything.

Time-bound routing becomes important. Rollover behavior starts affecting the migration process. Lifecycle execution can interfere with historical data movement. And configurations that work perfectly during live ingestion can create unexpected problems during migration.

The biggest mistake I made at the beginning was treating historical migration as a simple extension of the live ingestion workflow.

It is not.

Historical migration is a separate problem with its own constraints, and understanding those constraints is the key to building a migration strategy that actually works.


Understanding TSDS Time Bounds

Time Bounds

One of the most important concepts to understand during historical TSDS migration is the time boundary associated with a data stream.

Every TSDS backing index operates within a specific time window defined by two settings:

  • index.time_series.start_time
  • index.time_series.end_time

When a document arrives, Elasticsearch evaluates its @timestamp and determines whether the backing index is allowed to accept it. If the timestamp falls outside the accepted range, the document is rejected or routed according to TSDS rules.

This works extremely well for live ingestion because telemetry data naturally moves forward in time.

To support delayed events, Elasticsearch also provides:

index.look_back_time

This setting allows a newly created TSDS to accept older timestamps when the first backing index is created. However, the maximum supported value is only 7 days, with a default of 2 hours.

For most observability workloads, that is perfectly reasonable. A few minutes, hours, or even days of delayed telemetry is normal.

Historical migration is different.

In our case, we were not dealing with data that was a few hours or days old. We were dealing with months of historical telemetry that already existed inside standard indices.

At that point, increasing look_back_time is no longer a solution because the timestamps fall far outside the range TSDS is designed to handle automatically.

This is where historical migration stops being a simple reindex operation.

Instead, it becomes a problem of managing time boundaries, backing indices, rollover behavior, and lifecycle execution in a controlled way.

Once I understood that, many of the failures I was seeing suddenly started making sense.


The First Migration Attempt

Once I understood the time-bound nature of TSDS, the first migration strategy seemed straightforward.

The goal was simple: move historical telemetry from standard indices into TSDS and let Elasticsearch handle lifecycle management automatically.

The migration started successfully. Historical documents were being transferred, the data stream was accepting data, and everything initially looked healthy.

Then the first unexpected behavior appeared.

The historical dataset contained more than a billion documents for a single day. As the migration progressed, Elasticsearch eventually reached its rollover threshold and created a new backing index.

Under normal live ingestion, this is exactly what should happen.

The problem was that historical migration is not live ingestion.

The incoming documents still belonged to the original historical time window. Elasticsearch evaluated the timestamps and attempted to route them according to the time boundaries associated with the backing indices.

But the original backing index had already rolled over and was no longer accepting writes.

In other words, the data still belonged to the first backing index, but Elasticsearch had already moved on to the next one.

At that point, the migration started fighting against the TSDS lifecycle itself.

What made this particularly confusing was that nothing was actually wrong with Elasticsearch.

The system was behaving exactly as designed.

The real problem was my assumption that historical replay would behave like live ingestion.

It doesn't.

That was the moment I realized the challenge was no longer moving data from one index to another. The real challenge was controlling rollover behavior while historical data was still being replayed into the system.

That realization led to the first major redesign of the migration workflow.


Why Everything Went Into 000001

After understanding the rollover problem, the next question became obvious:

Why not simply prevent rollover until the historical migration is finished?

At first, this looked like a much better approach.

Instead of allowing Elasticsearch to create multiple backing indices during migration, all historical documents would be transferred into the first backing index:

.ds-<stream>-000001

Only after the migration completed would rollover be triggered and the normal TSDS lifecycle allowed to continue.

This solved the routing problem completely.

Historical documents no longer needed to compete with rollover boundaries. Every document belonging to the migration window could be written into the same backing index without Elasticsearch attempting to redirect it elsewhere.

The migration became stable.

But that stability came with a tradeoff.

Everything was now concentrated inside a single backing index.

For normal ingestion workloads, this is usually not a concern because data arrives gradually over time and rollover continuously distributes data across multiple backing indices.

Historical migration behaves differently.

A single backing index can end up containing hundreds of gigabytes or even terabytes of telemetry data.

That becomes extremely important once downsampling begins.

When Elasticsearch converts 5-minute telemetry into larger intervals such as 15 minutes or 1 hour, the operation is not happening magically in the background. Lucene still needs to read, aggregate, compact, and write large volumes of data.

The larger the backing index becomes, the more work Elasticsearch must perform against the same shards holding that historical data.

In our environment, downsampling hundreds of gigabytes of historical telemetry was no longer measured in minutes or hours.

It was measured in days.

At that point, I realized I had solved one problem by intentionally creating another.

The migration strategy was now technically correct.

The new challenge was making it operationally practical at scale.


Why More CPU And RAM Didn't Solve It

One of the first ideas we explored was adding more resources.

The logic seemed straightforward. If downsampling was taking too long, then the cluster probably needed more CPU or more memory.

After all, Elasticsearch is a distributed system. It is natural to assume that scaling the infrastructure will solve the problem.

Unfortunately, the bottleneck was not that simple.

By this stage of the migration, all historical documents had already been transferred into the first backing index. The migration was stable, but it created a new challenge: a massive amount of data now lived inside a single backing index.

When downsampling started, Elasticsearch needed to read that historical data, aggregate it into larger time buckets, and write the resulting documents into a new downsampled index. This work happens against the shards containing the source data and is both CPU and memory intensive.

In our environment, a single day could contain close to a terabyte of telemetry data.

The historical data was already sitting on the nodes that owned those shards. During downsampling, those same nodes were also responsible for performing the aggregation work and generating the new downsampled data. As resource utilization increased, operations slowed down, retried, and took significantly longer to complete.

My initial assumption was that horizontal scaling would solve the problem.

But horizontal scaling helps when workloads can be distributed across additional nodes. Historical downsampling is different. The source data already exists on specific shards, and those shards still need to perform most of the work. Adding more nodes does not automatically make a large historical backing index process faster.

The next idea was vertical scaling.

In theory, more CPU and memory would allow Elasticsearch to process the workload faster. But in practice, we decided not to pursue that approach because the expected benefit did not justify the additional infrastructure cost.

Even with significantly larger nodes, Elasticsearch would still need to read, aggregate, compact, and write the same amount of historical data. The work does not disappear.

The concern was that a task taking several weeks might become somewhat faster, but not fast enough to fundamentally change the migration strategy.

This is where the problem stopped being purely technical.

The original goal of introducing TSDS was to reduce storage costs and improve long-term retention efficiency. If solving historical downsampling requires substantial temporary infrastructure upgrades, the economics start becoming questionable.

At that point, the question was no longer:

"Can Elasticsearch downsample this data?"

The answer was clearly yes.

The real question became:

"Is the time and infrastructure cost required to downsample historical data worth the storage savings gained afterward?"

That tradeoff ultimately shaped the final migration strategy.


Other Approaches We Explored

After realizing that simply adding more resources would not fundamentally solve the problem, the next step was exploring alternative migration strategies.

The first idea was to distribute the historical data across multiple backing indices instead of concentrating everything inside 000001.

The reasoning was simple.

If a single backing index was becoming the bottleneck for downsampling, then spreading the historical data across multiple backing indices should distribute the workload and reduce pressure on any single node.

One experiment involved splitting a day's historical data into multiple time windows and attempting to route each window into a different backing index.

For example:

  • 00:00–04:00 → 000001
  • 04:00–08:00 → 000002
  • 08:00–12:00 → 000003

and so on.

On paper, this looked like a good solution. Instead of one backing index containing an entire day's worth of telemetry, the workload would be distributed across multiple backing indices, allowing downsampling to happen more evenly.

The problem was that TSDS does not work that way.

The first backing index can be created with custom index.time_series.start_time and index.time_series.end_time values. But once rollover creates additional backing indices, Elasticsearch manages those time boundaries internally.

Historical documents still need to satisfy the timestamp constraints associated with the backing index receiving them.

As a result, historical data could not simply be redirected into arbitrary backing indices to spread the workload.

The second idea was to move away from the Reindex API entirely and use a Scroll API based migration.

The workflow looked something like this:

  • Read documents using the Scroll API.
  • Process them in an external service.
  • Insert them back into TSDS through the ingestion pipeline.

At first glance, this appears to provide much more control over the migration process.

In reality, it introduced a completely different set of problems.

The Reindex API performs data movement entirely inside Elasticsearch. A Scroll API based solution introduces an additional application layer between the source and destination clusters.

Every document now needs to:

  • Leave Elasticsearch
  • Travel through the application
  • Be serialized and processed
  • Be sent back to Elasticsearch

That introduces additional network overhead, application overhead, and operational complexity.

More importantly, it still does not solve the actual TSDS problem.

Even if the migration logic lived outside Elasticsearch, the destination data stream would still enforce the same timestamp boundaries and routing rules.

In other words, we would be adding complexity without removing the core constraint.

At that point, it became clear that the migration mechanism was never the real bottleneck.

The challenge was understanding how to work with TSDS lifecycle behavior instead of trying to bypass it.

That realization ultimately led to the migration strategy that finally worked.


The Migration Strategy That Finally Worked

The Migration Strategy

After exploring multiple approaches, I eventually stopped trying to work around TSDS and started designing the migration around its internal behavior.

The final solution was not perfect.

In fact, it violates some of the patterns that Elasticsearch would naturally prefer for large-scale time-series workloads. But it was the most reliable approach I found for migrating large historical datasets while still preserving the ability to use TSDS lifecycle management afterward.

Before discussing the implementation, it is important to understand one thing.

This solution was designed specifically for historical migration.

It should not be considered a replacement for normal TSDS ingestion.

Under normal conditions, Elasticsearch expects data to arrive continuously, rollover naturally, and distribute data across backing indices over time. Historical migration breaks those assumptions because months of existing data must be replayed into a system that was originally designed around forward-moving timestamps.

Because of that, some compromises are necessary.

Choosing Control Over Automation

The first design decision was deciding how the migration itself should run.

There were two possible approaches.

The first option was a background job that automatically scans indices and starts migrations continuously.

For example, if the cluster contains:

  • telemetry-2026-01-01
  • telemetry-2026-01-02
  • telemetry-2026-01-03

the job could automatically discover matching indices and start migrating them.

This works reasonably well for small datasets.

The problem appears when the datasets become large.

If multiple migrations complete around the same time, the associated lifecycle operations can also start around the same time. That means multiple indices may begin downsampling simultaneously.

At that point, CPU, memory, and disk utilization can spike dramatically.

If Elasticsearch is also being used as a source of truth for production workloads, that becomes a risk.

For this reason, I strongly preferred controlled execution instead of fully automated execution.

The second option was exposing the migration through an API.

This is the approach I ultimately chose.

Instead of automatically processing every index, the migration is triggered intentionally through an API request. The payload contains the information required for a single migration, such as:

  • from_index
  • to_index
  • end_time
  • ilm_policy

This gives complete control over how each historical index is migrated and when lifecycle processing should begin.

The most important parameter in the payload is the end_time.

This value must be chosen carefully because it determines how long Elasticsearch will continue accepting historical documents into the TSDS backing index.

For example, if you are migrating data for April 1st, you should not set the end time to April 1st itself. Instead, you should extend the window and use April 2nd or, preferably, April 3rd.

Using April 3rd is generally safer because it gives Elasticsearch additional time to complete the reindex operation before the TSDS acceptance window closes. April 2nd will usually work as well, but it leaves less room for delays caused by cluster load, retries, or large datasets.

The reason this value is provided per migration request instead of being configured globally is to avoid lifecycle operations piling up at the same time.

For example, imagine every migration uses a static end time such as May 30th. In that case, all migrated indices would become eligible for subsequent lifecycle actions around the same period. Downsampling jobs could then start simultaneously across many indices, creating significant spikes in CPU, memory, and disk utilization.

By supplying the end time in the migration payload, each historical index can progress through its lifecycle independently. This allows downsampling and other lifecycle actions to occur gradually rather than all at once, resulting in much more predictable cluster behavior.

Before starting a migration, cluster health, available storage, resource utilization, and ongoing tasks can also be reviewed.

The process becomes slower operationally because someone needs to initiate it, but it becomes significantly safer for production environments.

For large historical migrations, control is usually more valuable than automation.

Step 1: Create The ILM Policy

The first step is creating the lifecycle policy that will eventually manage the migrated data.

A simplified version of the lifecycle looked like this:

  • Hot Phase - rollover after 1 day
  • Warm Phase - downsample from 5-minute telemetry to 15-minute telemetry
  • Cold Phase - downsample from 15-minute telemetry to 1-hour telemetry
  • Frozen Phase - snapshot the data into object storage

The exact intervals can vary depending on business requirements, but the important part is that the lifecycle already exists before migration begins.

Notice that I said the policy should exist.

I did not say it should be attached immediately.

That distinction becomes important later.

Step 2: Create The Initial TSDS Template

The next step is creating the TSDS template.

This template contains:

  • mappings
  • dimensions
  • data stream configuration
  • lifecycle configuration
  • TSDS settings

Most importantly, the first template contains:

  • index.time_series.start_time
  • index.time_series.end_time

These settings define the historical time window that Elasticsearch is allowed to accept.

Without them, the historical documents would fall outside the acceptable TSDS range and the migration would fail.

The migration end time becomes particularly important.

If the historical data belongs to April 1st, the end time should extend beyond that period so Elasticsearch continues accepting those documents during the migration.

The exact value is flexible, but it must be large enough to allow the migration to complete before the time window closes.

Step 3: Start The Reindex Operation

Once the template and data stream exist, the migration can begin.

For large datasets, reindexing becomes a major operation by itself.

In my testing, a single historical index containing hundreds of gigabytes of telemetry data could take many hours to complete.

The configuration I found most stable used:

  • slices = 5
  • requests_per_second configured appropriately for the cluster
  • size = 10000

The first two settings help control concurrency and throughput, but the third setting is equally important.

Elasticsearch's reindex API internally uses batches of documents that are held in memory while processing requests. By default, and in most practical scenarios, the maximum batch size should not exceed 10,000 documents per request.

This is effectively a limitation of the API and how Elasticsearch manages request payloads and heap memory during reindex operations.

For example:

POST _reindex
{
  "source": {
    "index": "source-index",
    "size": 10000
  },
  "dest": {
    "index": "destination-index"
  }
}
Enter fullscreen mode Exit fullscreen mode

Using a batch size of 10,000 documents is generally considered the safe upper limit.

If you attempt to push significantly larger batches, such as 15,000 or 20,000 documents per request, Elasticsearch may reject the request or fail due to payload and memory constraints. Depending on the version and cluster configuration, you may encounter errors indicating that the request exceeds allowed limits or that the payload is too large.

For that reason, I kept the batch size at 10,000 documents and relied on slicing and throttling to improve throughput rather than increasing the payload size.

The goal here is not to maximize speed.

The goal is to maintain predictable cluster behavior while the migration is running.

A migration that finishes slightly slower but keeps the cluster healthy is usually preferable to one that aggressively consumes resources and impacts production workloads.

Step 4: Remove The Time Boundaries

This is one of the most important parts of the process.

After the migration starts, a second template is created with lower priority.

This template removes the explicit:

  • start_time
  • end_time

configuration.

The reason is simple.

The custom time boundaries are needed only to allow historical documents to enter the first backing index.

Keeping those boundaries permanently can interfere with normal TSDS lifecycle behavior afterward.

Once the historical data is accepted, Elasticsearch should be allowed to resume managing the backing indices normally.

Step 5: Delay ILM Until Migration Completes

This was the biggest lesson learned from the entire project.

My original implementation attached the ILM policy immediately.

That worked fine until the document count became large.

Once the backing index reached rollover conditions, Elasticsearch behaved exactly as it was designed to behave.

It rolled over.

The problem was that the historical migration was still running.

The remaining documents still belonged to the first backing index, but Elasticsearch had already created the second one.

At that point, routing issues started appearing and the migration became unreliable.

The solution was surprisingly simple.

Do not attach the ILM policy at the beginning.

Allow the historical migration to finish first.

Only after the migration completes should rollover and lifecycle execution be enabled.

This prevents Elasticsearch from competing against the migration itself.

Step 6: Validate Using Counts Instead Of Task State

Another lesson came from monitoring reindex tasks.

Initially, I considered using the reindex task status to determine when the migration finished.

The problem is that task status alone is not always sufficient.

Retries, cluster interruptions, or transient failures can temporarily affect task visibility.

Instead, I found document counts to be a more reliable indicator.

The migration continuously compares:

  • source document count
  • destination document count

Once the destination reaches an acceptable threshold compared to the source, the migration is considered complete.

In practice, I found that waiting for roughly 98–99% completion before preparing the rollover process produced more reliable results than relying exclusively on task state.

Another thing to remember is that TSDS dimensions can also affect document counts. If duplicate telemetry already exists in the source data, TSDS may consolidate documents differently depending on the configured dimensions.

Because of that, count validation should always be interpreted with an understanding of the data model being migrated.

Step 7: Trigger Rollover And Attach ILM

Once the migration is validated:

  • Trigger rollover manually.
  • Close the first backing index for writes.
  • Attach the ILM policy.
  • Allow lifecycle execution to begin.

At this point, Elasticsearch can resume behaving like a normal TSDS deployment.

The migrated historical data is now inside TSDS and lifecycle management can take over.

The important tradeoff is that all historical data still resides inside the first backing index.

This is not ideal.

Under normal TSDS operation, data would naturally be distributed across multiple backing indices over time.

But for historical migration, this was the most reliable approach I found.

It solves the routing problem.

It solves the rollover problem.

It preserves lifecycle management.

And most importantly, it allows historical data to enter TSDS successfully without fighting against the internal assumptions that TSDS was designed around.


What I Would Recommend Today

After spending weeks experimenting with different migration approaches, failure modes, lifecycle configurations, and production-scale datasets, my recommendations today are actually very simple.

Recommendation 1: Start Using TSDS For Present And Future Data Immediately

If you are planning to move to TSDS, do not wait.

This is probably the biggest lesson from this entire journey.

For present and future ingestion, TSDS migration is relatively straightforward. Elasticsearch already provides the necessary documentation, APIs, templates, lifecycle policies, and migration paths.

Most of the effort is not in the implementation itself.

The real work is deciding:

  • which fields should be dimensions
  • what your ILM policy should look like
  • how long data should stay in each tier
  • when downsampling should occur

Once those decisions are made, the migration for future data is usually smooth.

More importantly, every day you postpone the migration creates more historical data that must eventually be migrated later.

Historical migration becomes harder as data grows.

Future ingestion does not.

If I were starting from scratch today, the first thing I would do is move all new telemetry workloads to TSDS as early as possible.

Recommendation 2: Treat Historical Migration As A Separate Problem

One mistake many teams make is treating future ingestion and historical migration as the same project.

They are not.

Future ingestion is usually a configuration problem.

Historical migration is an operational problem.

The strategies, risks, and timelines are completely different.

My recommendation is to stop the growth first.

Move all new data into TSDS.

Only after that should you decide what to do with the historical data.

That immediately prevents the historical migration problem from becoming larger every day.

Recommendation 3: Be Careful With Historical Downsampling

This is where my recommendation becomes much more conservative.

If you are dealing with relatively small historical datasets, downsampling is absolutely worth considering.

But once individual historical indices become very large, the economics start changing.

In our environment, some historical indices contained hundreds of gigabytes of telemetry data, and certain days approached nearly a terabyte of data.

At that scale, downsampling is no longer just a storage optimization feature.

It becomes a significant computational workload.

For example, converting 5-minute telemetry into 15-minute intervals may still be practical.

But aggressively pushing large historical datasets into much larger aggregation windows can become extremely time-consuming.

In my case:

  • 5-minute → 15-minute downsampling took multiple days
  • 15-minute → 1-hour downsampling was projected to take several weeks

At that point, the question is no longer whether Elasticsearch can do it.

The answer is yes.

The question becomes whether the time and infrastructure cost are justified.

Recommendation 4: Reindex First, Optimize Later

If preserving historical data is important, my preferred approach is:

  • Convert standard indices into TSDS.
  • Preserve the data.
  • Decide later whether downsampling is actually necessary.

Simply moving from standard indices to TSDS can already produce substantial storage savings.

In our environment, a historical index close to 900GB was reduced to roughly 500GB after migration to TSDS, even before any downsampling was applied.

That reduction alone can justify the migration effort.

Because of that, I would prioritize reindexing first and optimization second.

Storage reduction starts immediately after the TSDS migration.

Downsampling can always be evaluated later.

Recommendation 5: Use Frozen Storage Aggressively

If long-term retention is important, frozen storage is usually a better option than forcing aggressive downsampling across very large historical datasets.

Instead of spending weeks processing old telemetry, consider:

  • migrating the data into TSDS
  • moving older data into the Frozen tier
  • storing snapshots in object storage

The data remains available when needed while storage costs become significantly lower than keeping everything on hot or warm Elasticsearch nodes.

Query latency increases, but for historical investigations that is often an acceptable tradeoff.

Recommendation 6: Always Think About The Economics

This is ultimately the lesson that changed my perspective the most.

Most migration discussions focus entirely on whether something is technically possible.

A better question is:

Is it economically worth doing?

If a migration saves 400GB of storage but requires weeks of processing time, temporary infrastructure upgrades, and operational risk, then the decision becomes more complicated.

Engineering decisions should optimize both technical outcomes and operational cost.

TSDS absolutely solves the storage problem.

The challenge is deciding how much time and infrastructure you are willing to spend optimizing historical data.

For me, the best balance was:

  • move future data to TSDS immediately
  • migrate historical data gradually
  • preserve valuable data
  • use Frozen storage aggressively
  • downsample only when the benefit clearly outweighs the cost

That is the strategy I would follow if I had to start this entire migration journey again today.


Conclusion

When I started this journey, I assumed historical TSDS migration would be mostly a configuration exercise.

Create a data stream, configure the lifecycle policy, start the migration, and let Elasticsearch handle the rest.

The reality was very different.

What initially looked like a simple migration project eventually became an exercise in understanding how TSDS actually behaves under production-scale workloads. Time-bound routing, rollover behavior, lifecycle execution, downsampling costs, and operational tradeoffs all became important parts of the solution.

More importantly, this experience taught me that historical migration is fundamentally different from live ingestion.

The strategies that work perfectly for present and future data do not necessarily work for historical data. Once months of telemetry data already exist, migration becomes less about configuration and more about understanding the internal assumptions that TSDS was designed around.

The approach described in this blog is not necessarily the best solution.

It is simply the most reliable solution I found after exploring multiple approaches, testing different designs, and learning from a considerable number of failures along the way.

There may absolutely be better ways to solve this problem.

In fact, if you have faced a similar challenge and discovered a more efficient approach, I would genuinely be interested in hearing about it. One of the reasons I write these blogs is to learn from the community as much as to share my own experiences.

If there is one lesson I would leave you with, it is the same advice I mentioned in the first blog of this series:

If you are planning to move to TSDS, do it as early as possible.

Migrating present and future data is usually straightforward.

Migrating months of historical telemetry after the data has already accumulated is where the real complexity begins.

For me, the final answer was not aggressive downsampling, massive hardware upgrades, or trying to outsmart Elasticsearch.

The answer was understanding the tradeoffs, preserving the data that mattered, and choosing an approach that balanced storage savings, operational cost, and long-term maintainability.

And sometimes, that is what engineering is really about - not finding the perfect solution, but finding the solution that works reliably within the constraints you have.

Thank you for following this three-part TSDS journey. I hope the lessons, failures, and tradeoffs discussed throughout these blogs help make your own migration journey a little easier than mine.


🔗 Connect with Me

📖 Blog by Naresh B. A.

👨‍💻 Building AI & ML Systems | Backend-Focused Full Stack

🌐 Portfolio: Naresh B A

📫 Let's connect on LinkedIn | GitHub: Naresh B A

Thanks for spending your precious time reading this. It's my personal take on a tech topic, and I really appreciate you being here. ❤️

Top comments (0)