DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Multi-Cloud Lakehouse Blueprint: Migration Guide — Azure-Only to Multi-Cloud Lakehouse

Migration Guide — Azure-Only to Multi-Cloud Lakehouse

Datanest Digitaldatanest.dev

Overview

This guide provides a structured approach for organizations currently running Databricks
exclusively on Azure to extend their lakehouse to include AWS. It covers assessment,
planning, phased execution, validation, and cutover.


1. Migration Phases

Phase 0          Phase 1           Phase 2           Phase 3          Phase 4
Assessment  ──►  Foundation   ──►  Data Layer   ──►  Workloads   ──►  Operations
(2 weeks)        (4 weeks)         (4 weeks)         (4 weeks)        (Ongoing)

- Inventory      - AWS account     - Unity Catalog   - Migrate        - Monitoring
- Dependencies   - Networking      - Delta Sharing     selected       - DR drills
- Cost model     - IAM federation  - Replication       workloads      - Cost review
- Go/no-go       - Terraform       - Validation      - CI/CD          - Optimization
Enter fullscreen mode Exit fullscreen mode

2. Phase 0: Assessment (Weeks 1–2)

2.1 Workload Inventory

Catalog all existing Azure Databricks workloads:

Category What to Document
Workspaces Names, regions, SKU (Premium/Standard), VNet injection status
Clusters Types (all-purpose, job, SQL warehouse), instance types, autoscaling configs
Jobs Scheduled jobs, DLT pipelines, streaming jobs, dependencies
Data assets Unity Catalog catalogs, schemas, tables, volumes, external locations
Users & groups Count of users, group structure, admin accounts
Networking VNet configuration, private endpoints, firewall rules
Secrets Key Vault integrations, secret scopes
External integrations Power BI, ADF, Synapse, Event Hubs connections

2.2 Workload Classification

Classify each workload for migration disposition:

Disposition Criteria Example
Stay on Azure Tightly coupled to Azure services (ADF, Synapse), latency-sensitive to Azure storage Real-time ingestion from Event Hubs
Extend to AWS Benefits from multi-cloud (DR, cost optimization, regulatory) Batch analytics, ML training
Migrate to AWS Better suited for AWS (proximity to AWS data sources, team expertise) S3-native data sources
Shared Must be accessible from both clouds Reference data, dimension tables

2.3 Cost Modeling

Use tools/cost_comparison_model.py to estimate:

  • AWS infrastructure costs for target workloads
  • Cross-cloud networking costs (VPN, data transfer)
  • Incremental Unity Catalog licensing
  • Operational overhead (estimated person-hours for multi-cloud management)
  • Total cost of ownership comparison: Azure-only vs. multi-cloud

2.4 Go/No-Go Decision

Present findings to stakeholders. A multi-cloud extension is justified when:

  • Regulatory requirements mandate AWS-region data residency
  • DR requirements exceed what Azure-only can provide
  • Cost savings from AWS compute exceed the multi-cloud operational premium
  • Organizational acquisitions bring AWS-native workloads
  • Strategic vendor diversification is a board-level priority

3. Phase 1: Foundation (Weeks 3–6)

3.1 AWS Account Setup

  1. Create a dedicated AWS account (or use an existing one)
  2. Configure AWS Organizations with appropriate SCPs
  3. Enable CloudTrail for audit logging
  4. Configure AWS Config for compliance monitoring

3.2 Network Foundation

Deploy cross-cloud networking using terraform/aws/main.tf and the network
architecture in docs/network_architecture.md:

  1. Deploy AWS VPC with required subnets
  2. Deploy AWS Transit Gateway
  3. Configure site-to-site VPN between Azure VPN Gateway and AWS VGW
  4. Verify BGP route propagation
  5. Test connectivity from Azure Spoke VNet to AWS Workload VPC

Validation checkpoint:

# From an Azure VM in the spoke VNet
ping 10.1.16.10  # AWS workload VPC host

# From an AWS EC2 instance in the workload VPC
ping 10.0.16.10  # Azure spoke VNet host
Enter fullscreen mode Exit fullscreen mode

3.3 Identity Federation

Configure Azure AD → AWS IAM federation per docs/identity_federation.md:

  1. Create enterprise application for AWS Databricks
  2. Configure SAML SSO
  3. Configure SCIM provisioning
  4. Test SSO login to AWS Databricks account console
  5. Verify user and group sync

Validation checkpoint:

  • User can SSO into Azure Databricks (existing)
  • Same user can SSO into AWS Databricks (new)
  • Groups appear correctly in both workspace user lists

3.4 AWS Databricks Workspace Deployment

Deploy the AWS workspace using terraform/aws/main.tf:

cd terraform/aws
terraform init
terraform plan -var-file=production.tfvars -out=plan.tfplan
terraform apply plan.tfplan
Enter fullscreen mode Exit fullscreen mode

Validation checkpoint:

  • Workspace URL is accessible via Private Link
  • SSO login works
  • A test cluster can start and run a simple notebook

4. Phase 2: Data Layer (Weeks 7–10)

4.1 Unity Catalog Configuration

Deploy multi-cloud Unity Catalog using terraform/shared/unity-catalog-multicloud.tf:

  1. Create AWS-side metastore (if in a new region) or link to existing metastore
  2. Create external locations pointing to S3 buckets
  3. Configure storage credentials for cross-cloud access
  4. Create shared catalogs for data that must be accessible from both clouds

4.2 Data Replication Strategy

For data that must exist on both clouds, choose a replication pattern:

Pattern Use Case Latency Complexity
Delta Sharing Read-only cross-cloud access Real-time (query-time) Low
Deep Clone Full copy for DR or local performance Batch (scheduled) Medium
DLT with CDF Near-real-time sync of changed data Minutes High
Custom Spark job Complex transformation during replication Configurable High

4.3 Delta Sharing Setup

For read-only cross-cloud access (most common):

-- On Azure (provider)
CREATE SHARE analytics_share;
ALTER SHARE analytics_share ADD TABLE catalog.analytics.sales_summary;
ALTER SHARE analytics_share ADD TABLE catalog.analytics.product_dimensions;

CREATE RECIPIENT aws_workspace
  USING ID '<aws_metastore_id>';

GRANT SELECT ON SHARE analytics_share TO RECIPIENT aws_workspace;

-- On AWS (consumer) — access via Unity Catalog
-- Tables appear automatically under the shared catalog
SELECT * FROM shared_catalog.analytics.sales_summary;
Enter fullscreen mode Exit fullscreen mode

4.4 Deep Clone for DR

For data that must be physically present on AWS for disaster recovery:

# Scheduled job running on AWS workspace
tables_to_replicate = [
    "catalog.bronze.raw_transactions",
    "catalog.silver.cleaned_transactions",
    "catalog.gold.daily_aggregates",
]

for table in tables_to_replicate:
    target_table = table.replace("catalog.", "dr_catalog.")
    spark.sql(f"""
        CREATE OR REPLACE TABLE {target_table}
        DEEP CLONE {table}
    """)
Enter fullscreen mode Exit fullscreen mode

4.5 Data Validation

After replication, validate data integrity:

def validate_replication(source_table: str, target_table: str) -> bool:
    source_count = spark.sql(f"SELECT COUNT(*) as cnt FROM {source_table}").first().cnt
    target_count = spark.sql(f"SELECT COUNT(*) as cnt FROM {target_table}").first().cnt

    source_checksum = spark.sql(
        f"SELECT md5(concat_ws('|', *)) as hash FROM {source_table} ORDER BY 1"
    )
    target_checksum = spark.sql(
        f"SELECT md5(concat_ws('|', *)) as hash FROM {target_table} ORDER BY 1"
    )

    count_match = source_count == target_count
    hash_match = source_checksum.exceptAll(target_checksum).count() == 0

    return count_match and hash_match
Enter fullscreen mode Exit fullscreen mode

5. Phase 3: Workload Migration (Weeks 11–14)

5.1 Migration Order

Migrate workloads in order of increasing risk:

  1. Read-only analytics — SQL dashboards, reporting queries
  2. Batch jobs — scheduled ETL that can tolerate brief downtime
  3. ML training — model training workloads (not serving)
  4. Streaming — real-time ingestion and processing (highest risk)

5.2 Job Migration Checklist

For each job being migrated to AWS:

  • [ ] Verify all source data is accessible from AWS workspace
  • [ ] Update storage paths from abfss:// to s3:// (or use Unity Catalog table names)
  • [ ] Update cluster configuration for AWS instance types
  • [ ] Translate Azure Key Vault secret references to AWS Secrets Manager
  • [ ] Update any hardcoded Azure resource references
  • [ ] Run job in parallel on both clouds and compare outputs
  • [ ] Validate output data matches Azure version
  • [ ] Update downstream consumers to read from AWS output
  • [ ] Decommission Azure version (or keep for DR)

5.3 Instance Type Mapping

Azure VM Type AWS Instance Type vCPUs Memory (GB) Notes
Standard_DS3_v2 m5.xlarge 4 16 General purpose
Standard_DS4_v2 m5.2xlarge 8 32 General purpose
Standard_DS5_v2 m5.4xlarge 16 64 General purpose
Standard_E8s_v3 r5.2xlarge 8 64 Memory optimized
Standard_E16s_v3 r5.4xlarge 16 128 Memory optimized
Standard_NC6s_v3 p3.2xlarge 6/8 112/61 GPU (ML training)
Standard_L8s_v2 i3.2xlarge 8 61 Storage optimized

5.4 CI/CD Pipeline Update

Deploy the unified CI/CD pipeline from cicd/multi-cloud-pipeline.yml:

  1. Configure service connections for both Azure and AWS
  2. Update pipeline variables with both cloud workspace URLs
  3. Run pipeline in dry-run mode against both clouds
  4. Enable full deployment

6. Phase 4: Operations (Ongoing)

6.1 Monitoring Setup

  • Deploy centralized monitoring per ADR-012
  • Configure alerts for cross-cloud replication lag
  • Set up cost anomaly detection for both clouds
  • Enable Databricks SQL warehouse query monitoring

6.2 DR Drill Schedule

Drill Frequency Duration Scope
Connectivity test Weekly (automated) 5 minutes VPN tunnel health, DNS resolution
Data validation Daily (automated) 30 minutes Row count and checksum comparison
Failover test Quarterly (manual) 4 hours Full failover to AWS, run critical jobs
Full DR exercise Annually 1 day Complete failover, operate on AWS for 24h

6.3 Cost Optimization Cadence

Review Frequency Actions
Spot/reserved instance analysis Monthly Adjust reserved capacity commitments
Cross-cloud egress review Monthly Identify and reduce unnecessary data transfer
Unused resource cleanup Weekly (automated) Terminate idle clusters, delete orphaned resources
DBU rate comparison Quarterly Rebalance workloads between clouds for cost

6.4 Rollback Plan

If multi-cloud operations prove unsustainable:

  1. Stop all AWS-originated write workloads
  2. Ensure all data is replicated back to Azure
  3. Update downstream consumers to point to Azure
  4. Decommission AWS workspace and infrastructure
  5. Retain Terraform state and configuration for potential future re-deployment
  6. Update Unity Catalog to remove AWS external locations

Estimated rollback duration: 2–4 weeks for a controlled unwinding.


7. Common Migration Issues

Issue Symptom Resolution
SCIM sync delay Users can't log into AWS workspace Check SCIM provisioning logs in Azure AD; manual sync if needed
DNS resolution failure Private endpoint connections time out Verify cross-cloud DNS forwarders; check Route 53 resolver rules
VPN tunnel flapping Intermittent connectivity Check BGP timers; ensure DPD settings match on both sides
Delta Sharing latency Cross-cloud queries slow Verify VPN bandwidth; consider Deep Clone for high-frequency reads
Cluster startup failure "Insufficient capacity" on AWS Change to a different instance type or availability zone
Secret access errors Jobs fail with 403 on AWS Secrets Manager Verify IAM role trust policy includes Databricks service principal

Multi-Cloud Lakehouse Blueprint — Datanest Digital — datanest.dev


This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Multi-Cloud Lakehouse Blueprint] with all files, templates, and documentation for $69.

Get the Full Kit →

Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)