Sauveer Ketan

Posted on Jul 13

The Hidden Realities of Cloud Migration: Lessons from the Trenches

#aws #migration #rehost #devops

In theory, cloud migration is straightforward. You assess, plan, and execute. In practice, however, it's far more complex. While playbooks provide a solid framework, the real world often throws situations that demand adaptability and human intervention.

The Textbook View: Assess, Mobilize, Migrate

AWS's migration playbook outlines three distinct phases:

Assess: Understand your current environment and build a compelling business case for migration.
Mobilize: Prepare your AWS foundation, define your architecture, and finalize your migration plan.
Migrate: Execute the move of your workloads and data to AWS. In this phase, migration is divided into two stages, initialize and implement.

This structured approach sounds simple enough. Yet, once you start planning, you quickly realize that cloud migration success isn't just about technology; it's equally about people, processes, and preparedness.

When service providers present their proposals to clients, this difference in exposure often leads to heated discussions. Clients with no prior migration experience try to downplay the role of "known unknowns" and "unknown unknowns", they look for shorter timeline with fewer number of people. While the service providers with prior experience try to accommodate these unknowns based on their prior experience.

Tools

AWS provides a robust suite of tools to streamline cloud migrations, encompassing discovery, planning, and execution phases. Migration Evaluator, used in the initial Assess phase, is crucial for building a data-driven business case. AWS Migration Hub acts as a central console, offering a unified view of your migration progress and integrating with various services. For detailed assessment and dependency mapping, AWS Application Discovery Service helps gather crucial information about on-premises servers, applications, and their interdependencies. When it comes to the actual "lift-and-shift" of servers and virtual machines, AWS Application Migration Service (AWS MGN) is the go-to solution, automating server replication and cutover with minimal downtime, making it efficient for migrating diverse workloads, including those you might find with legacy systems.

The Cloud Migration Factory on AWS solution is designed to coordinate and automate manual processes for large-scale migrations involving a substantial number of applications. This solution helps enterprises improve performance and prevents long cutover windows by providing an orchestration platform for migrating workloads to AWS at scale.

AWS Transform for VMware is a newer service in the toolkit. It is an agentic AI service which automates application discovery and dependency mapping, network translation, wave planning, and server migration while optimizing EC2 instance selection to accelerate VMware workloads migration.

While AWS provides a robust suite of native tools, the broader cloud migration ecosystem also includes a variety of AWS Partner solutions. Cloudamize, modelizeIT, Flexera, RiverMeadow, etc., are a few of them.

What Really Happens: Real-World Cloud Migration

Despite rigorous preparation, detailed runbooks, and sophisticated migration tooling, we've consistently encountered numerous challenges. Here are few concrete lessons learned from actual cloud migrations (many of them) — issues that only became clear through firsthand experience. Here my focus is on rehost (lift-and-shift) migrations only.

1. Gaps in Dependency Mapping

Even with advanced automated discovery tools and thorough application team walkthroughs, some critical interdependencies inevitably slipped through the cracks. These hidden connections became glaringly apparent only during the high-pressure cutover windows.

Lesson: Always supplement automated discovery tool outputs with detailed, in-depth application interviews and meticulous system-level dependency validation.

2. Overprovisioning and Cost Optimization

A common initial misstep was placing all production databases on expensive io2 volumes, based on an assumption of high IOPS needs. In reality, most systems didn't require such high performance.

Mid-migration, we shifted our default storage strategy to gp3. Post-migration, we diligently monitored actual IOPS metrics and only upgraded volumes where necessary. We also planned and migrated the io2 volumes back to gp3, which is easy to do in AWS.

On similar lines, application teams sometimes want to have as much CPU and RAM as on-premises, fearing application performance degradation. It should be based on rightsizing recommended by the assessment tools, and decision should not be left in the hands of application teams (this requires senior stakeholders' buy-in and support). Resizing ec2 instances is easy and quick in AWS if needed post-migration.

Recommendation: Baseline your actual IOPS needs using actual metrics and avoid making assumptions about storage requirements. Rightsize ec2 instances based on assessment tool recommendation.

3. COTS Applications Can Be Complicated

Commercial Off-the-Shelf (COTS) applications often introduce unique hurdles:

Some had unsupported licensing models in AWS.
Others demanded minimum CPU or RAM allocations and did not work. Despite low actual utilization and hence smaller ec2 size recommended by the assessment tools.
Certain applications, like Tableau, could not be lifted and shifted directly due to architectural or licensing constraints.

Takeaway: Thoroughly review vendor support statements and validate technical feasibility with Proof of Concepts (PoCs) early in the project lifecycle. Discuss the plan with the vendor and engage them for migration window. Involve application teams during test cutovers, full-fledged testing might not be possible at this stage, see if some sanity tests can be performed.

4. Unexpected Machine Password Resets on Windows

A subtle yet disruptive issue involved monthly Windows machine password resets. Systems failed to join Active Directory during migration if their machine account password changed during the cutover window.

Fix: Implement pre-checks for password age and force resets before migration where necessary to ensure smooth domain joining.

5. Third-Party Tooling and Licensing

Many organizations rely on various third-party tools (monitoring, security, etc.) Tools crucial for post-migration access and verification, such as BeyondTrust PowerBroker, often had limited number of licenses and reassignment were required. This bottleneck caused significant delays in validation efforts.

Lesson: Proactively align tool licensing with peak migration activity requirements to prevent unexpected hold-ups.

6. fstab Issues in Linux Systems

Outdated fstab entries for outdated NFS mounts or Universally Unique Identifiers (UUIDs) for disks might lead to boot failures in Linux systems. In some cases, manual intervention via rescue mode was required.

Recommendation: Gracefully reboot servers few days prior to cutover. This simple step can surface latent boot-time issues before they impact your migration window.

7. On-Premise NFS Mounts Introduced Latency

We discovered that some applications, after migration, were referencing static content over Network File System (NFS) mounts from on-premise NFS servers. This introduced significant latency, impacting performance.

These mounts had to be migrated to AWS services like Amazon EFS or FSx. AWS Storage Gateway is also an option for hybrid cloud scenarios where some on-premise data might still need to be accessed. In some instances, we needed to build fallback mechanisms directly into the applications. In other cases, applications had to be rolled back, to be migrated later.

Recommendation: This takes us back to first point above, i.e., dependency mapping and proper testing before migration where needed.

8. Anti-Virus Interference with NFS

In one migration, a perplexing performance issue arose when anti-virus software was found to be scanning NFS-mounted directories. While ping and traceroute showed no network issues, application performance dropped dramatically. Because whenever anything was being uploaded to NFS server, anti-virus software was scanning it. The servers running anti-virus software were low on memory, a graceful reboot, fixed the issue. It took few hours and multiple people to figure the issue, as no single person had access to all the components involved. This is an excellent example of "unknown unknowns."

Mitigation: Support team incorporated in their playbook monthly anti-virus server reboots to alleviate this subtle yet impactful problem. Similar measures will be useful for other centralized tooling servers.

9. Legacy Systems Need Special Handling

Older systems, such as Windows 2008 servers, RHEL 5, etc., presented unique challenges. They could be migrated to older Xen based hypervisor only and hence had fewer options for ec2 servers. In a very special case, we migrated a Windows 2003 server with 1 GB of RAM. There was limited support available for these legacy OS and sometimes it was multiple hit and trials to successfully migrate them. Some configuration changes might also be required, for example, installing ENA drivers on RHEL 6 servers if we want to put them on nitro instances — missing which might lead to issues and troubleshooting.

If OS upgrade is an option, better consider that during migration planning rather than relying on legacy OS.

10. Decommission

In one migration, after rigorous assessment, it came out that out of around 500 servers, around 80 could be decommissioned. Leading to huge savings. Here is an interesting idea — even if you are not migrating to cloud in near future, why not do a full-fledged assessment periodically to realize such waste of resources.

Retire is one of the 7 Rs of migration (Retire, Retain, Rehost, Relocate, Replatform, Repurchase, and Refactor) and a very important one.

Operational & Technical Observations

Beyond specific application-level issues, several broader operational and technical challenges emerged:

AWS Limits: In few migrations, we were migrating hundreds of servers per week. We frequently encountered AWS service limits, including those for snapshots, API calls, and AWS MGN (or CloudEndure before that). This necessitated requesting additional accounts and quota increases. Plan for these in advance and get them increased beforehand.
Disk Attachment Limits: Some source systems exceeded EC2 disk attachment limits, requiring architectural restructuring to accommodate this. This is an edge case, but often these systems are critical systems. It should be taken as a key consideration during source assessment and target architecture design within the Mobilize phase.
Oracle ASM FD: We had multiple Oracle ASM FD servers. During one of the migrations after troubleshooting with AWS, it came out that while Oracle ASM (Automatic Storage Management) was supported by CloudEndure, not ASM FD (Filter Drive). AWS provided amazing support and gave a fix for this after few weeks, and we were able to successfully migrate these servers.
F5 iRules and ALB: Load balancer behaviors differed between on-premises F5 iRules and AWS ALB. During planning it came out that this will lead to refactoring of the applications. One example of this is, client IP handling, which was required by few applications directly, and it was working fine with on-prem F5 which was using direct pass-through, but ALB puts them in x-forwarded-for headers. These kind of scenarios provide opportunity to adopt other cloud-native features and modernize the applications.
Hardcoded IPs: A common problem, especially in development environments, was hardcoded IP addresses within TLS certificates and application configurations, complicating the migration process. These were often noticed after the migration.
ENI Pre-Provisioning: In specific scenarios, Elastic Network Interfaces (ENIs) had to be pre-created and preserved to ensure static IPs are known beforehand, so that they can be configured in firewalls and load balancers to avoid downtime. If this step is missed, it might lead to backout of the applications.

Process and Collaboration Insights

Effective process and strong collaboration were vital to navigating these complexities:

Weekly Lessons Learned (LL) calls: We instituted weekly LL calls every Tuesday post-migration to review and capture insights.
Centralized Documentation: All issues, along with their resolutions, were meticulously documented and stored in a central SharePoint portal for easy access and reuse.
Mandatory Review: All migration engineers were required to read, contribute to, and reuse this living documentation. They were randomly asked to give a walkthrough of these.
AWS TAM Coordination: Close coordination with AWS Technical Account Managers (TAMs) proved invaluable in resolving roadblocks and accelerating issue resolution.
Runbook Updates: Runbooks were continuously updated after each migration wave, incorporating real-world field feedback.
Roster Updates: Multiple teams need to be available during migration window — various infra support teams, application teams, vendor support for COTS applications, etc. Based on lessons learnt we rigorously updated our rosters and ensured the engagement.

Recommendations for Future Migrations

Based on our experiences, here are key recommendations for any organization embarking on a cloud migration journey:

Expect Surprises: Always anticipate the unexpected, especially when dealing with legacy configurations and potential human errors.
Create Buffer Bandwidth: Build in buffer capacity for both your engineering team and your project schedule to absorb unforeseen challenges. A single issue, and one of your engineers might be engaged for hours to fix that. While you had planned that she will handle 10 servers during migration!
Make Graceful Pre-Migration Reboots Standard: Implement pre-migration reboots as a standard procedure to surface latent boot-time issues before they impact your cutover window.
Validate Tools and Licensing: Thoroughly validate all tools and their licensing requirements for both pre- and post-migration activities.
Document Every Issue: Treat every encountered issue and its fix as a critical piece of your live migration playbook. Document everything diligently. Diligently have lessons learnt discussions after every wave. Update your runbooks and rosters accordingly.

Final Thoughts

Cloud migration is more than just a technological shift; it's a comprehensive change management initiative that impacts systems, teams, and deeply ingrained assumptions. No matter how meticulously you plan, real-world migrations will invariably expose issues that only practical experience can help you resolve.

The key to a successful migration lies in your ability to focus on learning, diligent documentation, and constant adaptation. That's what truly transforms a good migration strategy into a great one.

What are some unexpected challenges you've faced during cloud migrations, and how did you overcome them?

DEV Community