Cyril Sebastian

Posted on Jun 30

Planning a Cloud Migration? 10 Lessons from Production Cutovers

#aws #gcp #awschallenge #devops

Moving a production-grade application serving millions of daily users from one cloud provider to another is a high-stakes operation. After executing a complete GCP to AWS migrationincluding 21TB of data, MongoDB replica sets, MySQL clusters, and Apache Solr search infrastructure, here are ten critical lessons that separate successful migrations from costly disasters. Connect with me on LinkedIn for more real-world DevOps insights.

1. Hidden Dependencies Surface During Cutover Weekend

Even though we had tools for the job, coordinating syncs across staging, pre-prod, and production environments required careful orchestration and monitoring. Your application architecture diagram rarely tells the complete story. During our migration, we discovered hardcoded GCP configurations buried deep in environment variables and application configs that weren't documented anywhere. Network monitoring weeks before cutover revealed missed communication patterns, including internal DNS dependencies that required AWS Route 53 private hosted zones to replicate GCP's automatic internal DNS resolution.

2. Data Transfer Costs Are a Hidden Budget Killer

Data Transfer Out from GCP was a hidden but substantial cost center; plan for this when budgeting. Our 21TB migration taught us that egress charges can dramatically exceed your initial estimates. We saved thousands of dollars by identifying GCS buckets containing temporary compliance logs with auto-expiry policies and choosing to let them expire in GCP rather than migrating unnecessary data. Plan for these costs early and audit your data to ensure migration is necessary.

3. Database Replication Lag Is Your Biggest Enemy

One of the major blockers encountered was replication lag during promotion drills, especially under active write-heavy workloads. Our MySQL master-slave setup experienced significant lag during peak traffic. The solution was implementing iptables-based rules at the OS level to block application write traffic, allowing replication to catch up safely before cutover. This gave us a clean buffer for promotion without the risk of in-flight transactions.

4. Storage Behavior Differs Dramatically Between Clouds

Lazy loading of EBS volumes made autoscaling unreliable for time-sensitive indexing. Our Apache Solr migration revealed that AWS EBS volumes suffer from lazy loading, meaning data wasn't instantly accessible upon instance boot-up. In GCP, persistent volumes are mounted seamlessly with boot disks, enabling autoscaling. In AWS, we had to abandon autoscaling for Solr and use scheduled start/stop scripts instead. Factor in rebuild times and whether AWS Fast Snapshot Restore fits your budget.

The IOPS and throughput planning complexity were another major difference. GCP used to take care of IOPS and throughput by choosing SSD disks, and increasing the disk size if more IOPS and throughput are required. But in AWS, we had to plan accordingly as per the usage, especially for databases. EBS gp3 volumes require explicit IOPS and throughput provisioning separate from storage capacity, meaning our database performance tuning became a multi-dimensional optimization problem rather than GCP's simpler disk-size-based scaling.

5. Load Balancer Architectures Require Complete Rethinking

The differences between GCP's global HTTPS Load Balancer and AWS Application Load Balancer go beyond simple configuration. GCP's URL maps allowed expressive, path-based routing across services. AWS required translating these into listener rules and target groups, often resulting in more granular configurations. We moved from static public IPs to CNAME-based routing, requiring DNS strategy adjustments and SSL certificate management changes through AWS Certificate Manager.

6. Security Models Force Architectural Changes

All EC2 instances (except load balancers) were placed in private subnets for enhanced security. We had to implement bastion host access, update CORS headers for CloudFront integration, and create explicit firewall rules using iptables to control MySQL access during migration. AWS's security group model required translating GCP firewall rules while adding WAF integration for DDoS protection.

7. Network Restrictions Create Unexpected Blockers

AWS restricts outbound SMTP traffic on port 25 by default to prevent abuse. This is not the case in GCP, so ensure to factor this into your cutover timeline if you're migrating mail servers. Our Postfix mail servers required explicit AWS Support requests to open port 25, adding weeks to our timeline.

Beyond port restrictions, we discovered that our maximum utilized servers with low configurations couldn't be placed on burstable VM instances like t3 due to CPU credit limitations and network throttling. High-traffic applications that ran smoothly on GCP's custom CPU/RAM configurations suffered performance degradation when mapped to AWS burstable instances. We had to carefully analyze baseline vs. burst performance patterns and move critical workloads to dedicated instance types like m6i or c6i to avoid throttling during peak loads.

8. Rollback Plans Must Account for Cloud-Specific Behaviors

We implemented iptables-based rules at the OS level to block application write traffic, allowing replication to catch up safely before cutover. Our rollback strategy included controlled write freezes and hybrid MongoDB nodes that acted as bridges between cloud environments. The key was testing promotion and demotion scenarios multiple times, not just hoping data backups would suffice.

9. Monitoring Blind Spots Emerge in Cloud Transitions

We experienced monitoring gaps during the most critical phases when existing tools didn't translate to the new environment. Setting up CloudWatch, maintaining Nagios compatibility, and ensuring Grafana dashboards worked across both environments simultaneously was crucial. Establish baseline performance metrics in both clouds and create real-time visibility before cutover day.

10. Post-Migration Optimization Is Where Real Value Emerges

This migration tested our nerves and processes, but ultimately, it left us with better observability, tighter security, and an infrastructure we could proudly call production-grade. After successful cutover, we rightsized EC2 instances using historical metrics, implemented Savings Plans for reserved workloads, and enabled S3 lifecycle policies. The migration wasn't just about changing providers, it forced us to modernize our entire infrastructure approach.

The Reality of Production Cutovers

No plan survives contact without flexibility. Our experience proved that even with meticulous planning, live production environments will surprise you. We faced MySQL promotion lags, discovered hardcoded configurations requiring emergency patches, and dealt with Solr performance issues under load. The key was having a flexible team ready to improvise while maintaining strict rollback readiness.

The most important lesson? Migration is more than lift-and-shift; it's evolve or expire. Successful cloud migration requires embracing architectural differences rather than fighting them. Plan thoroughly, test relentlessly, and be prepared to adapt quickly when reality differs from your runbook.

Your cloud migration is an opportunity to build better infrastructure, not just move existing problems to a new provider. Approach it as a chance to evolve your entire operational model, and you'll emerge stronger on the other side.

DEV Community