Yeison Cruz

Posted on Mar 16

Why you should never migrate everything at once

#aws #devops #cloud #migration

You've assessed your environment. You know why you're modernizing. You understand the trade-offs. Now comes the question everyone dreads:

How do we actually move this thing?

And someone in the room always suggests: "Let's just migrate everything over a weekend."

No. Just... no.

Why "big bang" migrations fail

Big bang means you flip a switch, move everything at once, and hope it works. It's tempting because it sounds simple.

It's not simple. It's a disaster waiting to happen.

Here's what goes wrong:

You can't test everything. You think you can, but you can't. There's always something you missed.

You can't roll back easily. Once you've moved 50 services, rolling back means moving 50 services back. Good luck doing that at 2 AM.

You can't learn from mistakes. If something breaks, you're fixing it under pressure with everyone watching.

You put all your eggs in one basket. One mistake and the entire business is down.

The only time big bang works: When you're migrating something so simple it doesn't matter. A static website. A single service with no dependencies. That's it.

For everything else? Phase it.

What phased migration actually means

You break the migration into small, manageable chunks. You move one piece at a time. You validate it works. Then you move the next piece.

The benefits:

You learn as you go. The first migration teaches you what works and what doesn't.

You can roll back easily. If something breaks, you only roll back one piece, not everything.

You reduce risk. A small failure is manageable. A total failure is career-ending.

You show progress. Leadership sees results every few weeks, not "we'll be done in 6 months, trust us."

How to design your phases

Step 1: Map your dependencies

You can't migrate things in random order. You need to know what depends on what.

Draw it out:

What talks to what?
What can run independently?
What's blocking other things?

Example:

Users → Load Balancer → App Servers → Database → S3
Background Jobs → Queue → Workers → Database

You can't migrate the database before the app servers. You can't migrate the app servers before you have a load balancer.

Step 2: Identify your migration candidates

Not everything is equally easy to migrate. Some things are simple. Some are nightmares.

Easy candidates (start here):

Stateless services (no database, no persistent storage)
Low-traffic services (if it breaks, not many people notice)
Non-critical services (internal tools, admin panels)
Services with good test coverage

Hard candidates (save for later):

Stateful services (databases, file storage)
High-traffic services (your main API, payment processing)
Critical services (if it's down, the business is down)
Services with no tests (you don't know if it works until it doesn't)

Step 3: Group into waves

A wave is a batch of services you migrate together. Usually 2-4 weeks per wave.

Wave 1: The pilot

Pick 1-2 low-risk services
Migrate them fully
Validate everything works
Document what you learned

Goal: Prove the process works. Build confidence. Find the problems when the stakes are low.

Wave 2-3: Low-hanging fruit

Migrate the easy stuff
Build momentum
Refine your process
Train the team

Goal: Show progress. Get quick wins. Make leadership happy.

Wave 4-6: The meat of it

Migrate the core services
This is where the real work happens
Move carefully, test thoroughly

Goal: Get the important stuff done without breaking anything.

Wave 7+: The hard stuff

Databases, stateful services, critical systems
By now you know what you're doing
You've learned from earlier mistakes

Goal: Finish strong. Migrate the scary stuff with confidence.

Step 4: Plan your rollback strategy

Before you migrate anything, know how to undo it.

For each wave, document:

How to roll back (step by step)
How long rollback takes
What data might be lost
Who makes the call to roll back

Example rollback plan:

Switch DNS back to old load balancer (5 minutes)
Verify old system is receiving traffic (2 minutes)
Stop new services to avoid confusion (1 minute)
Sync any data created during migration (30 minutes)

If you can't roll back in under an hour, your phase is too big.

Migration strategies by service type

Stateless services (APIs, web apps)

Strategy: Blue/green deployment

Deploy new version alongside old version
Route 10% of traffic to new version
Monitor for errors
Gradually increase to 50%, then 100%
Shut down old version

Why it works: You can roll back instantly by routing traffic back.

Databases

Strategy: Parallel run

Set up new database
Replicate data from old to new (continuously)
Run both databases in parallel
Switch reads to new database (writes still go to old)
Monitor for issues
Switch writes to new database
Keep old database running for a week
Shut down old database

Why it works: You're never fully committed until you're sure it works.

Background jobs

Strategy: Gradual migration

Deploy new workers alongside old workers
Route new jobs to new workers
Let old workers finish existing jobs
Shut down old workers when queue is empty

Why it works: No jobs are lost, no downtime.

File storage

Strategy: Lazy migration

Set up new storage (S3, EFS, whatever)
Write new files to new storage
Keep old files in old storage
Migrate old files in batches (off-peak hours)
Update app to check both locations
Eventually everything's in new storage

Why it works: No big bang, no downtime, no rush.

Real-world example: E-commerce site migration

The setup:

Monolithic app on EC2
MySQL database
Redis cache
S3 for images
10,000 daily users

Wave 1: Static assets (Week 1-2)

Migrate images to CloudFront + S3
Update app to use new URLs
Validate images load correctly
Risk: Low. Rollback: Easy.

Wave 2: Redis cache (Week 3-4)

Set up ElastiCache
Run both caches in parallel
Switch to ElastiCache
Monitor for issues
Risk: Low. Rollback: Easy.

Wave 3: App servers (Week 5-8)

Containerize app (ECS)
Deploy behind new ALB
Route 10% traffic to new app
Gradually increase to 100%
Risk: Medium. Rollback: Moderate.

Wave 4: Database (Week 9-12)

Set up RDS with replication from old DB
Switch reads to RDS
Monitor performance
Switch writes to RDS
Keep old DB running for 2 weeks
Risk: High. Rollback: Hard but possible.

Wave 5: Cleanup (Week 13-14)

Shut down old infrastructure
Update documentation
Celebrate

Total time: 14 weeks. Zero downtime. No disasters.

Common mistakes

Making phases too big. If a phase takes more than 4 weeks, it's too big. Break it down.

Not testing rollback. You don't want to figure out how to roll back at 3 AM when production is down.

Skipping the pilot. The first migration teaches you everything. Don't skip it.

Moving too fast. Velocity is good, but not at the expense of risk. Give each phase time to stabilize.

Not documenting lessons learned. What you learn in Wave 1 should improve Wave 2. Write it down.

Ignoring dependencies. You can't migrate the app before the database. Map it out first.

What you should have for each phase

Before the phase:

List of what's being migrated
Dependencies mapped out
Rollback plan documented
Success criteria defined
Team knows their roles

During the phase:

Monitoring and alerts set up
Communication plan (who to notify, when)
Go/no-go decision points
Rollback trigger conditions

After the phase:

Validation that everything works
Performance comparison (old vs. new)
Lessons learned documented
Cleanup tasks completed

The golden rule

If you can't roll back in under an hour, your phase is too big.

Break it down. Make it smaller. Reduce the risk.

DEV Community