You've assessed your environment. You know why you're modernizing. You understand the trade-offs. Now comes the question everyone dreads:
How do we actually move this thing?
And someone in the room always suggests: "Let's just migrate everything over a weekend."
No. Just... no.
Why "big bang" migrations fail
Big bang means you flip a switch, move everything at once, and hope it works. It's tempting because it sounds simple.
It's not simple. It's a disaster waiting to happen.
Here's what goes wrong:
You can't test everything. You think you can, but you can't. There's always something you missed.
You can't roll back easily. Once you've moved 50 services, rolling back means moving 50 services back. Good luck doing that at 2 AM.
You can't learn from mistakes. If something breaks, you're fixing it under pressure with everyone watching.
You put all your eggs in one basket. One mistake and the entire business is down.
The only time big bang works: When you're migrating something so simple it doesn't matter. A static website. A single service with no dependencies. That's it.
For everything else? Phase it.
What phased migration actually means
You break the migration into small, manageable chunks. You move one piece at a time. You validate it works. Then you move the next piece.
The benefits:
You learn as you go. The first migration teaches you what works and what doesn't.
You can roll back easily. If something breaks, you only roll back one piece, not everything.
You reduce risk. A small failure is manageable. A total failure is career-ending.
You show progress. Leadership sees results every few weeks, not "we'll be done in 6 months, trust us."
How to design your phases
Step 1: Map your dependencies
You can't migrate things in random order. You need to know what depends on what.
Draw it out:
- What talks to what?
- What can run independently?
- What's blocking other things?
Example:
- Users → Load Balancer → App Servers → Database → S3
- Background Jobs → Queue → Workers → Database
You can't migrate the database before the app servers. You can't migrate the app servers before you have a load balancer.
Step 2: Identify your migration candidates
Not everything is equally easy to migrate. Some things are simple. Some are nightmares.
Easy candidates (start here):
- Stateless services (no database, no persistent storage)
- Low-traffic services (if it breaks, not many people notice)
- Non-critical services (internal tools, admin panels)
- Services with good test coverage
Hard candidates (save for later):
- Stateful services (databases, file storage)
- High-traffic services (your main API, payment processing)
- Critical services (if it's down, the business is down)
- Services with no tests (you don't know if it works until it doesn't)
Step 3: Group into waves
A wave is a batch of services you migrate together. Usually 2-4 weeks per wave.
Wave 1: The pilot
- Pick 1-2 low-risk services
- Migrate them fully
- Validate everything works
- Document what you learned
Goal: Prove the process works. Build confidence. Find the problems when the stakes are low.
Wave 2-3: Low-hanging fruit
- Migrate the easy stuff
- Build momentum
- Refine your process
- Train the team
Goal: Show progress. Get quick wins. Make leadership happy.
Wave 4-6: The meat of it
- Migrate the core services
- This is where the real work happens
- Move carefully, test thoroughly
Goal: Get the important stuff done without breaking anything.
Wave 7+: The hard stuff
- Databases, stateful services, critical systems
- By now you know what you're doing
- You've learned from earlier mistakes
Goal: Finish strong. Migrate the scary stuff with confidence.
Step 4: Plan your rollback strategy
Before you migrate anything, know how to undo it.
For each wave, document:
- How to roll back (step by step)
- How long rollback takes
- What data might be lost
- Who makes the call to roll back
Example rollback plan:
- Switch DNS back to old load balancer (5 minutes)
- Verify old system is receiving traffic (2 minutes)
- Stop new services to avoid confusion (1 minute)
- Sync any data created during migration (30 minutes)
If you can't roll back in under an hour, your phase is too big.
Migration strategies by service type
Stateless services (APIs, web apps)
Strategy: Blue/green deployment
- Deploy new version alongside old version
- Route 10% of traffic to new version
- Monitor for errors
- Gradually increase to 50%, then 100%
- Shut down old version
Why it works: You can roll back instantly by routing traffic back.
Databases
Strategy: Parallel run
- Set up new database
- Replicate data from old to new (continuously)
- Run both databases in parallel
- Switch reads to new database (writes still go to old)
- Monitor for issues
- Switch writes to new database
- Keep old database running for a week
- Shut down old database
Why it works: You're never fully committed until you're sure it works.
Background jobs
Strategy: Gradual migration
- Deploy new workers alongside old workers
- Route new jobs to new workers
- Let old workers finish existing jobs
- Shut down old workers when queue is empty
Why it works: No jobs are lost, no downtime.
File storage
Strategy: Lazy migration
- Set up new storage (S3, EFS, whatever)
- Write new files to new storage
- Keep old files in old storage
- Migrate old files in batches (off-peak hours)
- Update app to check both locations
- Eventually everything's in new storage
Why it works: No big bang, no downtime, no rush.
Real-world example: E-commerce site migration
The setup:
- Monolithic app on EC2
- MySQL database
- Redis cache
- S3 for images
- 10,000 daily users
Wave 1: Static assets (Week 1-2)
- Migrate images to CloudFront + S3
- Update app to use new URLs
- Validate images load correctly
- Risk: Low. Rollback: Easy.
Wave 2: Redis cache (Week 3-4)
- Set up ElastiCache
- Run both caches in parallel
- Switch to ElastiCache
- Monitor for issues
- Risk: Low. Rollback: Easy.
Wave 3: App servers (Week 5-8)
- Containerize app (ECS)
- Deploy behind new ALB
- Route 10% traffic to new app
- Gradually increase to 100%
- Risk: Medium. Rollback: Moderate.
Wave 4: Database (Week 9-12)
- Set up RDS with replication from old DB
- Switch reads to RDS
- Monitor performance
- Switch writes to RDS
- Keep old DB running for 2 weeks
- Risk: High. Rollback: Hard but possible.
Wave 5: Cleanup (Week 13-14)
- Shut down old infrastructure
- Update documentation
- Celebrate
Total time: 14 weeks. Zero downtime. No disasters.
Common mistakes
Making phases too big. If a phase takes more than 4 weeks, it's too big. Break it down.
Not testing rollback. You don't want to figure out how to roll back at 3 AM when production is down.
Skipping the pilot. The first migration teaches you everything. Don't skip it.
Moving too fast. Velocity is good, but not at the expense of risk. Give each phase time to stabilize.
Not documenting lessons learned. What you learn in Wave 1 should improve Wave 2. Write it down.
Ignoring dependencies. You can't migrate the app before the database. Map it out first.
What you should have for each phase
Before the phase:
- List of what's being migrated
- Dependencies mapped out
- Rollback plan documented
- Success criteria defined
- Team knows their roles
During the phase:
- Monitoring and alerts set up
- Communication plan (who to notify, when)
- Go/no-go decision points
- Rollback trigger conditions
After the phase:
- Validation that everything works
- Performance comparison (old vs. new)
- Lessons learned documented
- Cleanup tasks completed
The golden rule
If you can't roll back in under an hour, your phase is too big.
Break it down. Make it smaller. Reduce the risk.
Top comments (0)