SagarTrimukhe

Posted on Jan 26

Zero downtime, lossless migration of records to ElasticSearch indexes

#elasticsearch #cloud #designpatterns

Migrating nearly 2TB of audit log data to new Elasticsearch (ES) indexes on an AWS OpenSearch cluster is no small feat—especially when the system is live, handling thousands of new logs every minute, and strict consistency is non-negotiable. Here’s how we tackled this challenge, ensuring zero downtime, no data loss, and minimal impact on other services.

Why Migrate?

Our existing ES schema was no longer meeting the needs of our growing, scalable requirements. Query performance was lagging, and new use cases demanded a more flexible schema. We needed to redesign our ES indexes to support faster retrieval and richer queries, all while keeping response times low.

Challenges

Live System: The audit log service is mission-critical, with other services depending on it for workflow completion. Downtime was not an option.
High Throughput: Thousands of logs are ingested every minute. Missing even a single log could mislead customers, as logs track everything from device additions to user logins.
Shared ES Cluster: Multiple services use the same AWS OpenSearch (Elasticsearch-compatible) cluster, so heavy migration scripts could not be allowed to degrade API latencies.
Large Data Volume: Each month had its own ES index, with four indexes active at any time. We needed to migrate all data from each to new indexes.
No Data Loss, No Duplicates: Consistency was paramount. We could not afford to lose or duplicate any audit logs.

Planning the Migration

1. Understanding the Scale

We started by gathering production-scale data: index sizes, average and maximum latencies, average record size and CPU, memory and latency tolerance thresholds. Since we’d be copying (not moving) data, we also ensured we had enough resources to hold the duplicated data during migration.

To ensure a smooth OpenSearch data migration without overwhelming the cluster, it is important to operate within safe resource limits. Keep CPU utilization of each data node below 70% and maintain JVM memory pressure below 80% throughout the migration process.

When planning the migration indexing rate, also account for existing indexing load, search traffic, and search latency. New indexing throughput should be introduced gradually, based on calculated cluster headroom, assuming a near-linear impact since indexing is both CPU and I/O bound.

The safe migration indexing rate can be estimated using the following model:

CPU safe threshold (CPU_safe): 70% (matches the target data-node ceiling)

Current CPU usage (CPU_now): measured as x

JVM safe threshold (JVM_safe): 80% (keeps pressure under the GC risk zone)

Current JVM pressure (JVM_now): observed JVM memory pressure

JVM growth factor (JVM_growth per 1K): ≈1.2% additional JVM pressure per extra 1,000 docs/sec indexed

On the storage side, although gp3 EBS volumes support 250 MB/sec throughput per node, indexing can consume disk bandwidth. A conservative estimate is:

40 MB/sec disk throughput per 1,000 docs/sec of indexing load

To maintain additional safety, reserve a 30% buffer to absorb unpredictable load increases and avoid resource saturation (multiply by 0.7).

Safe Migration Indexing Formula

R_{safe} = R_{current} \times \min\left(\frac{CPU_{safe}}{CPU_{now}}, \frac{JVM_{safe} - JVM_{now}}{JVM_{growth/1K}}\right) \times 0.7

Practical Example

If the current cluster handles 20,000 records per minute (RPM) and both CPU and JVM show 50% available headroom, you can safely plan to migrate an additional 10,000 records per minute without overloading the system (effective rate ≈30,000 RPM, ≈43 million records per day). This keeps:

CPU remains under 55–60%

JVM pressure stays below 80–82%

Disk throughput remains well within safe limits.

This approach ensures stable search performance and a reliable migration process without compromising cluster health.

2. Efficient Data Reading

Reading data from ES indexes had to be done serially. Parallel reads would add too much query overhead. We considered migrating customer-by-customer, but with 400k+ accounts and the need to fetch account IDs from another microservice, this approach was too complex and would require excessive checkpoint management.

Instead, we used the ES scroll API (search_after/point-in-time works similarly), which allowed us to read data in serial batches using a cursor value. This cursor was stored in Redis as a checkpoint for each month’s index.

3. Designing for Reliability

Given the massive data size, number of records and resource constraints, migration would take time. We needed a process that could be paused or stopped at any point, and resumed from the last checkpoint in case of failure—without starting over. For this, we used Redis to store migration checkpoints, with a 7-day TTL for safety.

Checkpointing Strategy: When to Save the checkpoint in Redis?

Instead of writing the ES cursor to Redis every time after reading data from ES, we decided to write it at 5-minute intervals. This approach kept the write load low on Redis. Since our Redis cluster is also a shared infrastructure, we wanted to avoid any spikes or unnecessary load on Redis during the migration process. A restart could replay up to ~50k docs (5 minutes × 10k/min), which was acceptable because writes were idempotent (same document IDs) and bulk replays were safe for downstream services.

4. Deployment Strategy

The migration script was containerized and deployed as a Kubernetes Job. On startup, it checked Redis for an existing checkpoint. If none was found, it started a fresh migration. Depending on the system load we spanned multiple job instances, each handling a different month’s index. The script was designed to take the month as a parameter, allowing multiple instances to run concurrently without conflict.

5. Batch Processing and Parallel Writes

Each ES call fetched 10,000 records. We split these into ten batches of 1,000 records each, processing them in parallel. Each batch was converted to the new schema, sometimes requiring calls to other internal services for additional data. Once processed, each batch was written back to ES using bulk writes.

We could have increased parallelism, but since each record conversion involved external service calls, we prioritized system stability over speed. The migration was intentionally slow to avoid impacting customers or other services. Bulk writers honored backpressure—if CPU/JVM/disk exceeded thresholds or bulk responses returned rejections, the job applied exponential backoff before resuming.

6. Handling Duplicates and Consistency

A key risk was partial batch processing: if the migration was interrupted after fetching 10,000 records but before all were written, restarting from the last checkpoint could result in duplicate writes. To prevent this, we used the original ES document ID as the ID in the new index. Elasticsearch overwrites documents with the same ID, ensuring no duplicates.

Results

600 Million Logs Migrated: Over four months, we processed about 600 million audit logs.
20 Million Records/Day: On average, 20 million records were migrated daily.
Zero Downtime, No Data Loss: The migration was seamless, with no impact on customers or dependent services.

Once all the data was copied to the index in the updated schema, only then did we enable the system to read from the new indexes. This ensured there was no impact to existing system features during the migration.

As data is continuously flowing into the system for the current month index, we had another module that was writing to the new index continuously, alongside writing to the existing old index. There was no issue with older month indexes, as data cannot be added to those indexes. The system is write- or append-only—no edits or deletes are allowed.

Key Takeaways

Checkpoints are Critical: Storing ES cursor values in Redis allowed safe, resumable migrations.
Batching and Parallelism: Careful batching and controlled parallelism balanced speed and system stability.
Idempotency: Using document IDs to prevent duplicates was essential for data integrity.
Preparation Pays Off: Understanding production data and system constraints upfront made all the difference.

DEV Community