Phase 1 is never 'lift and shift.' Here's the framework that keeps production stable throughout the entire move.
After leading migrations for organizations ranging from 200 to 4,000 engineers, I've distilled it into 6 phases that keep production alive and teams sane.
Phase 1: Discovery & Dependency Mapping
Before touching a single VM, audit everything.
Inventory all services: apps, databases, middleware, integrations
Map inter service dependencies including the undocumented ones (ask the engineers who've been there longest)
Tag everything by criticality, data sensitivity, migration complexity
Identify the "spiders in the web" services everything else depends on
Tools: AWS Application Discovery Service, Cloudamize, or a structured spreadsheet + interviews.
Phase 2: Cloud Foundation & Landing Zone
Build the platform before you migrate anything.
Set up your VPC architecture (hub spoke or flat decide now)
Implement IAM roles, SCPs, and guardrails
Deploy centralized logging, monitoring, and alerting
Establish your IaC standard (Terraform modules, Pulumi stacks — standardize before day one)
Set cost budgets and alerts BEFORE anything runs
Never skip this. Teams that migrate first and "sort out governance later" always regret it.
Phase 3: Pilot Migration (Noncritical workloads)
Start small. Build confidence.
Choose a non-critical, low traffic service
Run full migration lifecycle: move → validate → monitor → optimize
Document everything. This becomes your runbook for every subsequent migration
Identify gaps in tooling and process while the stakes are low
Phase 4: Wave Based Migration
Organize workloads into migration waves by complexity:
Wave 1: Stateless apps (easiest)
Wave 2: Stateful apps with managed DB alternatives
Wave 3: Complex integrations and legacy services
Wave 4 (optional): Services that require refactoring before migration
Run each wave over 2–4-week sprints. Include a 2-week stabilization window after each wave before proceeding.
Phase 5: Cutover & Traffic Management
Zero-downtime means dual running, not big-bang.
Use feature flags and DNS-based traffic shifting (Route 53 weighted routing or equivalent)
Implement a circuit breaker to instantly roll back traffic if error rates spike
Keep on prem running in parallel for 30–60 days post cutover
Don't decommission anything until you've run at least one full billing cycle on cloud
Phase 6: Optimize & Decommission
The migration isn't over when the apps are running.
Right-size instances based on real usage data (wait at least 2 weeks)
Implement autoscaling where you were running fixed capacity
Set up FinOps dashboards cloud spend will be surprising at first
Decommission on prem systematically, not in a rush
Run a migration retrospective with every team involved
The Rule I Never Break
No service goes to production without observability in place (logs, metrics, traces), a documented runbook, an on-call rotation assigned, and a tested rollback procedure.
Save for your next migration kickoff
Top comments (0)