DEV Community

Imran Siddique
Imran Siddique

Posted on • Originally published at Medium on

The Hardest Migration: Moving Petabytes of Data While It Changes

How to perform a schema change on a live, massive-scale database.

Updating the OS is easy. Updating the Data is hard.

In a stateful system, eventually, you will hit a wall. You need to change the schema, the sharding strategy, or the underlying storage format. You cannot just “restart” the nodes because the data itself is incompatible. You have to rewrite petabytes of history.

And you have to do it while users are still writing new data.

This is the strategy we used to migrate massive accounts in Azure DevOps with zero downtime.

1. The “Side-by-Side” Infrastructure

We didn’t try to migrate in place. That is a recipe for corruption. We adopted a Primary/Primary mindset. We spun up a brand-new infrastructure (V2) alongside the old one (V1). For a period of time, both existed simultaneously.

2. The Automated “Account Mover”

You cannot migrate petabytes at once. We built an automated engine to migrate one Account at a time.

  1. The Snapshot: The system picks an account and begins re-indexing its data from the Source of Truth into the new V2 cluster.
  2. The Gap (The Hard Part): Re-indexing a large account might take 2 hours. In those 2 hours, the user has committed new code. V2 is already stale.
  3. The Catch-Up: Before we flip the switch, the system enters a “Catch-Up” phase. It looks at the delta — the events that happened during the migration — and replays them onto V2.

3. The Atomic Switch

Once V2 is perfectly synced with V1, the automation flips the pointer.

  • Before: User reads from V1.
  • After: User reads from V2.

The user never knows it happened. They just see that their search results are suddenly faster or better.

4. The Safety Net

We never delete V1 immediately. Once the switch happens, V1 sits there, cold but available. If V2 catastrophic fails, we can flip the switch back. Only after a “cooling off” period do we finally decommission the old infrastructure.

The Lesson: Migrations at scale are not manual tasks. They are software products. If you have to migrate more than 10 accounts, build a robot to do it for you.

Top comments (0)