DEV Community

Cover image for Zero Downtime Cloud Migration: The 6-Phase Playbook
varun varde
varun varde

Posted on

Zero Downtime Cloud Migration: The 6-Phase Playbook

Phase 1 is never 'lift and shift.' Here's the framework that keeps production stable throughout the entire move.

After leading migrations for organizations ranging from 200 to 4,000 engineers, I've distilled it into 6 phases that keep production alive and teams sane.

Phase 1: Discovery & Dependency Mapping
Before touching a single VM, audit everything.

  • Inventory all services: apps, databases, middleware, integrations

  • Map inter service dependencies including the undocumented ones (ask the engineers who've been there longest)

  • Tag everything by criticality, data sensitivity, migration complexity

  • Identify the "spiders in the web" services everything else depends on

Tools: AWS Application Discovery Service, Cloudamize, or a structured spreadsheet + interviews.

Phase 2: Cloud Foundation & Landing Zone
Build the platform before you migrate anything.

  • Set up your VPC architecture (hub spoke or flat decide now)

  • Implement IAM roles, SCPs, and guardrails

  • Deploy centralized logging, monitoring, and alerting

  • Establish your IaC standard (Terraform modules, Pulumi stacks — standardize before day one)

  • Set cost budgets and alerts BEFORE anything runs
    Never skip this. Teams that migrate first and "sort out governance later" always regret it.

Phase 3: Pilot Migration (Noncritical workloads)
Start small. Build confidence.

  • Choose a non-critical, low traffic service

  • Run full migration lifecycle: move → validate → monitor → optimize

  • Document everything. This becomes your runbook for every subsequent migration

  • Identify gaps in tooling and process while the stakes are low

Phase 4: Wave Based Migration
Organize workloads into migration waves by complexity:

  • Wave 1: Stateless apps (easiest)

  • Wave 2: Stateful apps with managed DB alternatives

  • Wave 3: Complex integrations and legacy services

  • Wave 4 (optional): Services that require refactoring before migration

Run each wave over 2–4-week sprints. Include a 2-week stabilization window after each wave before proceeding.

Phase 5: Cutover & Traffic Management
Zero-downtime means dual running, not big-bang.

  • Use feature flags and DNS-based traffic shifting (Route 53 weighted routing or equivalent)

  • Implement a circuit breaker to instantly roll back traffic if error rates spike

  • Keep on prem running in parallel for 30–60 days post cutover

  • Don't decommission anything until you've run at least one full billing cycle on cloud

Phase 6: Optimize & Decommission
The migration isn't over when the apps are running.

  • Right-size instances based on real usage data (wait at least 2 weeks)

  • Implement autoscaling where you were running fixed capacity

  • Set up FinOps dashboards cloud spend will be surprising at first

  • Decommission on prem systematically, not in a rush

  • Run a migration retrospective with every team involved

The Rule I Never Break
No service goes to production without observability in place (logs, metrics, traces), a documented runbook, an on-call rotation assigned, and a tested rollback procedure.

Save for your next migration kickoff

Top comments (0)