DatanestDigital

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Multi-Cloud Lakehouse Blueprint: Migration Guide — Azure-Only to Multi-Cloud Lakehouse

#terraform #aws #azure #databricks

Migration Guide — Azure-Only to Multi-Cloud Lakehouse

Datanest Digital — datanest.dev

Overview

This guide provides a structured approach for organizations currently running Databricks
exclusively on Azure to extend their lakehouse to include AWS. It covers assessment,
planning, phased execution, validation, and cutover.

1. Migration Phases

Phase 0          Phase 1           Phase 2           Phase 3          Phase 4
Assessment  ──►  Foundation   ──►  Data Layer   ──►  Workloads   ──►  Operations
(2 weeks)        (4 weeks)         (4 weeks)         (4 weeks)        (Ongoing)

- Inventory      - AWS account     - Unity Catalog   - Migrate        - Monitoring
- Dependencies   - Networking      - Delta Sharing     selected       - DR drills
- Cost model     - IAM federation  - Replication       workloads      - Cost review
- Go/no-go       - Terraform       - Validation      - CI/CD          - Optimization

2. Phase 0: Assessment (Weeks 1–2)

2.1 Workload Inventory

Catalog all existing Azure Databricks workloads:

Category	What to Document
Workspaces	Names, regions, SKU (Premium/Standard), VNet injection status
Clusters	Types (all-purpose, job, SQL warehouse), instance types, autoscaling configs
Jobs	Scheduled jobs, DLT pipelines, streaming jobs, dependencies
Data assets	Unity Catalog catalogs, schemas, tables, volumes, external locations
Users & groups	Count of users, group structure, admin accounts
Networking	VNet configuration, private endpoints, firewall rules
Secrets	Key Vault integrations, secret scopes
External integrations	Power BI, ADF, Synapse, Event Hubs connections

2.2 Workload Classification

Classify each workload for migration disposition:

Disposition	Criteria	Example
Stay on Azure	Tightly coupled to Azure services (ADF, Synapse), latency-sensitive to Azure storage	Real-time ingestion from Event Hubs
Extend to AWS	Benefits from multi-cloud (DR, cost optimization, regulatory)	Batch analytics, ML training
Migrate to AWS	Better suited for AWS (proximity to AWS data sources, team expertise)	S3-native data sources
Shared	Must be accessible from both clouds	Reference data, dimension tables

2.3 Cost Modeling

Use tools/cost_comparison_model.py to estimate:

AWS infrastructure costs for target workloads
Cross-cloud networking costs (VPN, data transfer)
Incremental Unity Catalog licensing
Operational overhead (estimated person-hours for multi-cloud management)
Total cost of ownership comparison: Azure-only vs. multi-cloud

2.4 Go/No-Go Decision

Present findings to stakeholders. A multi-cloud extension is justified when:

Regulatory requirements mandate AWS-region data residency
DR requirements exceed what Azure-only can provide
Cost savings from AWS compute exceed the multi-cloud operational premium
Organizational acquisitions bring AWS-native workloads
Strategic vendor diversification is a board-level priority

3. Phase 1: Foundation (Weeks 3–6)

3.1 AWS Account Setup

Create a dedicated AWS account (or use an existing one)
Configure AWS Organizations with appropriate SCPs
Enable CloudTrail for audit logging
Configure AWS Config for compliance monitoring

3.2 Network Foundation

Deploy cross-cloud networking using terraform/aws/main.tf and the network
architecture in docs/network_architecture.md:

Deploy AWS VPC with required subnets
Deploy AWS Transit Gateway
Configure site-to-site VPN between Azure VPN Gateway and AWS VGW
Verify BGP route propagation
Test connectivity from Azure Spoke VNet to AWS Workload VPC

Validation checkpoint:

# From an Azure VM in the spoke VNet
ping 10.1.16.10  # AWS workload VPC host

# From an AWS EC2 instance in the workload VPC
ping 10.0.16.10  # Azure spoke VNet host

3.3 Identity Federation

Configure Azure AD → AWS IAM federation per docs/identity_federation.md:

Create enterprise application for AWS Databricks
Configure SAML SSO
Configure SCIM provisioning
Test SSO login to AWS Databricks account console
Verify user and group sync

Validation checkpoint:

User can SSO into Azure Databricks (existing)
Same user can SSO into AWS Databricks (new)
Groups appear correctly in both workspace user lists

3.4 AWS Databricks Workspace Deployment

Deploy the AWS workspace using terraform/aws/main.tf:

cd terraform/aws
terraform init
terraform plan -var-file=production.tfvars -out=plan.tfplan
terraform apply plan.tfplan

Validation checkpoint:

Workspace URL is accessible via Private Link
SSO login works
A test cluster can start and run a simple notebook

4. Phase 2: Data Layer (Weeks 7–10)

4.1 Unity Catalog Configuration

Deploy multi-cloud Unity Catalog using terraform/shared/unity-catalog-multicloud.tf:

Create AWS-side metastore (if in a new region) or link to existing metastore
Create external locations pointing to S3 buckets
Configure storage credentials for cross-cloud access
Create shared catalogs for data that must be accessible from both clouds

4.2 Data Replication Strategy

For data that must exist on both clouds, choose a replication pattern:

Pattern	Use Case	Latency	Complexity
Delta Sharing	Read-only cross-cloud access	Real-time (query-time)	Low
Deep Clone	Full copy for DR or local performance	Batch (scheduled)	Medium
DLT with CDF	Near-real-time sync of changed data	Minutes	High
Custom Spark job	Complex transformation during replication	Configurable	High

4.3 Delta Sharing Setup

For read-only cross-cloud access (most common):

-- On Azure (provider)
CREATE SHARE analytics_share;
ALTER SHARE analytics_share ADD TABLE catalog.analytics.sales_summary;
ALTER SHARE analytics_share ADD TABLE catalog.analytics.product_dimensions;

CREATE RECIPIENT aws_workspace
  USING ID '<aws_metastore_id>';

GRANT SELECT ON SHARE analytics_share TO RECIPIENT aws_workspace;

-- On AWS (consumer) — access via Unity Catalog
-- Tables appear automatically under the shared catalog
SELECT * FROM shared_catalog.analytics.sales_summary;

4.4 Deep Clone for DR

For data that must be physically present on AWS for disaster recovery:

# Scheduled job running on AWS workspace
tables_to_replicate = [
    "catalog.bronze.raw_transactions",
    "catalog.silver.cleaned_transactions",
    "catalog.gold.daily_aggregates",
]

for table in tables_to_replicate:
    target_table = table.replace("catalog.", "dr_catalog.")
    spark.sql(f"""
        CREATE OR REPLACE TABLE {target_table}
        DEEP CLONE {table}
    """)

4.5 Data Validation

After replication, validate data integrity:

def validate_replication(source_table: str, target_table: str) -> bool:
    source_count = spark.sql(f"SELECT COUNT(*) as cnt FROM {source_table}").first().cnt
    target_count = spark.sql(f"SELECT COUNT(*) as cnt FROM {target_table}").first().cnt

    source_checksum = spark.sql(
        f"SELECT md5(concat_ws('|', *)) as hash FROM {source_table} ORDER BY 1"
    )
    target_checksum = spark.sql(
        f"SELECT md5(concat_ws('|', *)) as hash FROM {target_table} ORDER BY 1"
    )

    count_match = source_count == target_count
    hash_match = source_checksum.exceptAll(target_checksum).count() == 0

    return count_match and hash_match

5. Phase 3: Workload Migration (Weeks 11–14)

5.1 Migration Order

Migrate workloads in order of increasing risk:

Read-only analytics — SQL dashboards, reporting queries
Batch jobs — scheduled ETL that can tolerate brief downtime
ML training — model training workloads (not serving)
Streaming — real-time ingestion and processing (highest risk)

5.2 Job Migration Checklist

For each job being migrated to AWS:

[ ] Verify all source data is accessible from AWS workspace
[ ] Update storage paths from abfss:// to s3:// (or use Unity Catalog table names)
[ ] Update cluster configuration for AWS instance types
[ ] Translate Azure Key Vault secret references to AWS Secrets Manager
[ ] Update any hardcoded Azure resource references
[ ] Run job in parallel on both clouds and compare outputs
[ ] Validate output data matches Azure version
[ ] Update downstream consumers to read from AWS output
[ ] Decommission Azure version (or keep for DR)

5.3 Instance Type Mapping

Azure VM Type	AWS Instance Type	vCPUs	Memory (GB)	Notes
Standard_DS3_v2	m5.xlarge	4	16	General purpose
Standard_DS4_v2	m5.2xlarge	8	32	General purpose
Standard_DS5_v2	m5.4xlarge	16	64	General purpose
Standard_E8s_v3	r5.2xlarge	8	64	Memory optimized
Standard_E16s_v3	r5.4xlarge	16	128	Memory optimized
Standard_NC6s_v3	p3.2xlarge	6/8	112/61	GPU (ML training)
Standard_L8s_v2	i3.2xlarge	8	61	Storage optimized

5.4 CI/CD Pipeline Update

Deploy the unified CI/CD pipeline from cicd/multi-cloud-pipeline.yml:

Configure service connections for both Azure and AWS
Update pipeline variables with both cloud workspace URLs
Run pipeline in dry-run mode against both clouds
Enable full deployment

6. Phase 4: Operations (Ongoing)

6.1 Monitoring Setup

Deploy centralized monitoring per ADR-012
Configure alerts for cross-cloud replication lag
Set up cost anomaly detection for both clouds
Enable Databricks SQL warehouse query monitoring

6.2 DR Drill Schedule

Drill	Frequency	Duration	Scope
Connectivity test	Weekly (automated)	5 minutes	VPN tunnel health, DNS resolution
Data validation	Daily (automated)	30 minutes	Row count and checksum comparison
Failover test	Quarterly (manual)	4 hours	Full failover to AWS, run critical jobs
Full DR exercise	Annually	1 day	Complete failover, operate on AWS for 24h

6.3 Cost Optimization Cadence

Review	Frequency	Actions
Spot/reserved instance analysis	Monthly	Adjust reserved capacity commitments
Cross-cloud egress review	Monthly	Identify and reduce unnecessary data transfer
Unused resource cleanup	Weekly (automated)	Terminate idle clusters, delete orphaned resources
DBU rate comparison	Quarterly	Rebalance workloads between clouds for cost

6.4 Rollback Plan

If multi-cloud operations prove unsustainable:

Stop all AWS-originated write workloads
Ensure all data is replicated back to Azure
Update downstream consumers to point to Azure
Decommission AWS workspace and infrastructure
Retain Terraform state and configuration for potential future re-deployment
Update Unity Catalog to remove AWS external locations

Estimated rollback duration: 2–4 weeks for a controlled unwinding.

7. Common Migration Issues

Issue	Symptom	Resolution
SCIM sync delay	Users can't log into AWS workspace	Check SCIM provisioning logs in Azure AD; manual sync if needed
DNS resolution failure	Private endpoint connections time out	Verify cross-cloud DNS forwarders; check Route 53 resolver rules
VPN tunnel flapping	Intermittent connectivity	Check BGP timers; ensure DPD settings match on both sides
Delta Sharing latency	Cross-cloud queries slow	Verify VPN bandwidth; consider Deep Clone for high-frequency reads
Cluster startup failure	"Insufficient capacity" on AWS	Change to a different instance type or availability zone
Secret access errors	Jobs fail with 403 on AWS Secrets Manager	Verify IAM role trust policy includes Databricks service principal

Multi-Cloud Lakehouse Blueprint — Datanest Digital — datanest.dev

This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Multi-Cloud Lakehouse Blueprint] with all files, templates, and documentation for $69.

Get the Full Kit →

Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.

Get the Complete Bundle →

DEV Community

Multi-Cloud Lakehouse Blueprint: Migration Guide — Azure-Only to Multi-Cloud Lakehouse

Migration Guide — Azure-Only to Multi-Cloud Lakehouse

Overview

1. Migration Phases

2. Phase 0: Assessment (Weeks 1–2)

2.1 Workload Inventory

2.2 Workload Classification

2.3 Cost Modeling

2.4 Go/No-Go Decision

3. Phase 1: Foundation (Weeks 3–6)

3.1 AWS Account Setup

3.2 Network Foundation

3.3 Identity Federation

3.4 AWS Databricks Workspace Deployment

4. Phase 2: Data Layer (Weeks 7–10)

4.1 Unity Catalog Configuration

4.2 Data Replication Strategy

4.3 Delta Sharing Setup

4.4 Deep Clone for DR

4.5 Data Validation

5. Phase 3: Workload Migration (Weeks 11–14)

5.1 Migration Order

5.2 Job Migration Checklist

5.3 Instance Type Mapping

5.4 CI/CD Pipeline Update

6. Phase 4: Operations (Ongoing)

6.1 Monitoring Setup

6.2 DR Drill Schedule

6.3 Cost Optimization Cadence

6.4 Rollback Plan

7. Common Migration Issues

Related Articles

Top comments (0)