DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Cloud Migration Playbook

Cloud Migration Playbook

A structured, phase-based framework for migrating workloads to the cloud with confidence. This playbook covers the complete migration lifecycle — from initial portfolio assessment and dependency mapping through wave planning, execution, validation, and rollback procedures. Each phase includes decision templates, checklists, Terraform scaffolding for landing zones, and scripts for cutover automation. Built for teams who need a repeatable process, not just a high-level slide deck.

Key Features

  • 6R Classification Tool — Decision matrix for categorizing each workload as Rehost, Replatform, Refactor, Repurchase, Retire, or Retain
  • Dependency Mapper — Scripts and templates for discovering application dependencies, network flows, and data relationships
  • Wave Planning Templates — Spreadsheet-based wave organizer grouping workloads by dependency, risk, and business criticality
  • Landing Zone Scaffolding — Terraform modules for provisioning target accounts, VPCs, and security baselines pre-migration
  • Cutover Runbooks — Step-by-step procedures with parallel tracks for database migration, DNS cutover, and application switchover
  • Rollback Procedures — Pre-tested rollback scripts with decision criteria for when to abort a migration wave
  • Validation Framework — Automated smoke tests, performance benchmarks, and data integrity checks post-migration
  • Stakeholder Templates — Communication plans, status reports, and go/no-go decision documents

Quick Start

# Step 1: Run the portfolio assessment
python3 src/assessment/portfolio_scanner.py \
  --inventory inventory.csv \
  --output-dir reports/assessment/

# Step 2: Generate wave plan
python3 src/planning/wave_planner.py \
  --assessment reports/assessment/classified.json \
  --max-wave-size 10 \
  --output reports/wave-plan.json

# Step 3: Deploy landing zone for wave 1
cd src/terraform/landing-zone
terraform init
terraform apply -var="wave=1"
Enter fullscreen mode Exit fullscreen mode

Architecture

┌──────────────────────────────────────────────────────────┐
│                  Migration Lifecycle                     │
│                                                          │
│  Phase 1         Phase 2        Phase 3       Phase 4    │
│  ASSESS          PLAN           EXECUTE       VALIDATE   │
│  ┌─────────┐    ┌─────────┐   ┌─────────┐  ┌─────────┐ │
│  │Discover │    │Wave     │   │Provision│  │Smoke    │ │
│  │Classify │───►│Planning │──►│Migrate  │─►│Test     │ │
│  │Baseline │    │Runbooks │   │Cutover  │  │Perf.    │ │
│  │Deps Map │    │Rollback │   │DNS Swap │  │Data OK  │ │
│  └─────────┘    └─────────┘   └────┬────┘  └─────────┘ │
│                                    │                     │
│                              ┌─────▼─────┐               │
│                              │ Rollback? │               │
│                              │  YES: Run │               │
│                              │  rollback │               │
│                              │  script   │               │
│                              └───────────┘               │
└──────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Usage Examples

6R Classification Logic

# src/assessment/classifier.py
from dataclasses import dataclass
from enum import Enum

class MigrationStrategy(Enum):
    REHOST = "rehost"           # Lift-and-shift to cloud VMs
    REPLATFORM = "replatform"   # Minor optimization (e.g., managed DB)
    REFACTOR = "refactor"       # Re-architect for cloud-native
    REPURCHASE = "repurchase"   # Replace with SaaS
    RETIRE = "retire"           # Decommission
    RETAIN = "retain"           # Keep on-premises

@dataclass
class WorkloadAssessment:
    name: str
    strategy: MigrationStrategy
    complexity: str             # low, medium, high
    business_criticality: str   # low, medium, high, critical
    estimated_effort_days: int
    dependencies: list[str]

def classify_workload(
    age_years: int,
    has_saas_alternative: bool,
    monthly_users: int,
    technical_debt_score: float,  # 0.0 to 1.0
) -> MigrationStrategy:
    """Recommend migration strategy based on workload characteristics."""
    if monthly_users == 0:
        return MigrationStrategy.RETIRE
    if has_saas_alternative and technical_debt_score > 0.7:
        return MigrationStrategy.REPURCHASE
    if technical_debt_score > 0.8:
        return MigrationStrategy.REFACTOR
    if age_years > 10:
        return MigrationStrategy.REPLATFORM
    return MigrationStrategy.REHOST
Enter fullscreen mode Exit fullscreen mode

Landing Zone Terraform Module

# src/terraform/landing-zone/main.tf
module "vpc" {
  source  = "./modules/vpc"
  cidr    = var.vpc_cidr
  azs     = var.availability_zones
  project = var.project_name
  env     = "migration-wave-${var.wave}"
}

module "security_baseline" {
  source              = "./modules/security"
  vpc_id              = module.vpc.vpc_id
  enable_flow_logs    = true
  enable_guardduty    = true
  log_retention_days  = 90
}

module "migration_staging" {
  source            = "./modules/staging"
  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.private_subnet_ids
  # Temporary resources for migration — destroyed after validation
  instance_type     = "m5.xlarge"
  replication_agent = true
}
Enter fullscreen mode Exit fullscreen mode

Cutover Checklist as Code

# configs/cutover-checklist.yaml
pre_cutover:
  - name: Freeze code deployments
    owner: dev-team
    verification: "No open PRs targeting main branch"
  - name: Final data sync
    owner: dba-team
    verification: "Replication lag < 1 second"
  - name: Notify stakeholders
    owner: project-lead
    verification: "Slack message sent to #migration channel"

cutover:
  - name: Stop writes to source database
    command: "python3 scripts/freeze_source_db.py"
    rollback: "python3 scripts/unfreeze_source_db.py"
  - name: Final replication catch-up
    command: "python3 scripts/wait_for_sync.py --max-wait 300"
    timeout_seconds: 600
  - name: Switch DNS to cloud endpoint
    command: "python3 scripts/dns_cutover.py --target cloud"
    rollback: "python3 scripts/dns_cutover.py --target onprem"

post_cutover:
  - name: Run smoke tests
    command: "pytest tests/smoke/ -v --timeout=120"
    required: true
  - name: Monitor error rates for 30 minutes
    command: "python3 scripts/monitor_errors.py --duration 1800"
    required: true
Enter fullscreen mode Exit fullscreen mode

Configuration

# configs/migration-config.yaml
project_name: acme-cloud-migration
target_cloud: aws                  # aws, azure, or gcp
target_region: us-east-1

waves:
  max_workloads_per_wave: 10       # Keep waves small and manageable
  min_days_between_waves: 7        # Buffer for issue resolution
  blackout_dates:                  # No migrations during these periods
    - "2026-03-25/2026-03-31"      # End of quarter
    - "2026-12-20/2027-01-05"      # Holiday freeze

rollback:
  auto_rollback_on_smoke_failure: true
  max_rollback_window_hours: 4     # After this, rollback becomes risky
  keep_source_running_days: 14     # Source stays active post-cutover

notifications:
  slack_webhook: YOUR_SLACK_WEBHOOK_HERE
  email: migration-team@example.com
Enter fullscreen mode Exit fullscreen mode

Best Practices

  • Assess everything before migrating anything — Full portfolio discovery prevents surprises in later waves
  • Start with low-risk workloads — Build team confidence and refine processes before touching critical systems
  • Keep wave sizes small — 5-10 workloads per wave; larger waves increase blast radius if something goes wrong
  • Test rollback before you need it — Run a practice rollback during wave 1 even if the migration succeeds
  • Maintain source in parallel — Keep on-prem systems running for at least 2 weeks post-cutover as a safety net
  • Track dependencies in a graph — Spreadsheets fail at showing transitive dependencies; use a proper dependency map

Troubleshooting

Issue Cause Fix
Replication agent can't connect to target Security group missing ingress rule for replication port Add TCP 443/1500 inbound from source IP range
Data integrity check fails post-migration Writes occurred during cutover window Re-run final sync; ensure source was frozen before cutover
DNS propagation delayed Long TTL on existing DNS records Lower TTL to 60s at least 24h before cutover
Application errors after cutover Hardcoded IP addresses in application config Search configs for source IPs; replace with cloud endpoints or DNS names

This is 1 of 11 resources in the Cloud Architecture Pro toolkit. Get the complete [Cloud Migration Playbook] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire Cloud Architecture Pro bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)