Migration Guide — Azure-Only to Multi-Cloud Lakehouse
Datanest Digital — datanest.dev
Overview
This guide provides a structured approach for organizations currently running Databricks
exclusively on Azure to extend their lakehouse to include AWS. It covers assessment,
planning, phased execution, validation, and cutover.
1. Migration Phases
Phase 0 Phase 1 Phase 2 Phase 3 Phase 4
Assessment ──► Foundation ──► Data Layer ──► Workloads ──► Operations
(2 weeks) (4 weeks) (4 weeks) (4 weeks) (Ongoing)
- Inventory - AWS account - Unity Catalog - Migrate - Monitoring
- Dependencies - Networking - Delta Sharing selected - DR drills
- Cost model - IAM federation - Replication workloads - Cost review
- Go/no-go - Terraform - Validation - CI/CD - Optimization
2. Phase 0: Assessment (Weeks 1–2)
2.1 Workload Inventory
Catalog all existing Azure Databricks workloads:
| Category | What to Document |
|---|---|
| Workspaces | Names, regions, SKU (Premium/Standard), VNet injection status |
| Clusters | Types (all-purpose, job, SQL warehouse), instance types, autoscaling configs |
| Jobs | Scheduled jobs, DLT pipelines, streaming jobs, dependencies |
| Data assets | Unity Catalog catalogs, schemas, tables, volumes, external locations |
| Users & groups | Count of users, group structure, admin accounts |
| Networking | VNet configuration, private endpoints, firewall rules |
| Secrets | Key Vault integrations, secret scopes |
| External integrations | Power BI, ADF, Synapse, Event Hubs connections |
2.2 Workload Classification
Classify each workload for migration disposition:
| Disposition | Criteria | Example |
|---|---|---|
| Stay on Azure | Tightly coupled to Azure services (ADF, Synapse), latency-sensitive to Azure storage | Real-time ingestion from Event Hubs |
| Extend to AWS | Benefits from multi-cloud (DR, cost optimization, regulatory) | Batch analytics, ML training |
| Migrate to AWS | Better suited for AWS (proximity to AWS data sources, team expertise) | S3-native data sources |
| Shared | Must be accessible from both clouds | Reference data, dimension tables |
2.3 Cost Modeling
Use tools/cost_comparison_model.py to estimate:
- AWS infrastructure costs for target workloads
- Cross-cloud networking costs (VPN, data transfer)
- Incremental Unity Catalog licensing
- Operational overhead (estimated person-hours for multi-cloud management)
- Total cost of ownership comparison: Azure-only vs. multi-cloud
2.4 Go/No-Go Decision
Present findings to stakeholders. A multi-cloud extension is justified when:
- Regulatory requirements mandate AWS-region data residency
- DR requirements exceed what Azure-only can provide
- Cost savings from AWS compute exceed the multi-cloud operational premium
- Organizational acquisitions bring AWS-native workloads
- Strategic vendor diversification is a board-level priority
3. Phase 1: Foundation (Weeks 3–6)
3.1 AWS Account Setup
- Create a dedicated AWS account (or use an existing one)
- Configure AWS Organizations with appropriate SCPs
- Enable CloudTrail for audit logging
- Configure AWS Config for compliance monitoring
3.2 Network Foundation
Deploy cross-cloud networking using terraform/aws/main.tf and the network
architecture in docs/network_architecture.md:
- Deploy AWS VPC with required subnets
- Deploy AWS Transit Gateway
- Configure site-to-site VPN between Azure VPN Gateway and AWS VGW
- Verify BGP route propagation
- Test connectivity from Azure Spoke VNet to AWS Workload VPC
Validation checkpoint:
# From an Azure VM in the spoke VNet
ping 10.1.16.10 # AWS workload VPC host
# From an AWS EC2 instance in the workload VPC
ping 10.0.16.10 # Azure spoke VNet host
3.3 Identity Federation
Configure Azure AD → AWS IAM federation per docs/identity_federation.md:
- Create enterprise application for AWS Databricks
- Configure SAML SSO
- Configure SCIM provisioning
- Test SSO login to AWS Databricks account console
- Verify user and group sync
Validation checkpoint:
- User can SSO into Azure Databricks (existing)
- Same user can SSO into AWS Databricks (new)
- Groups appear correctly in both workspace user lists
3.4 AWS Databricks Workspace Deployment
Deploy the AWS workspace using terraform/aws/main.tf:
cd terraform/aws
terraform init
terraform plan -var-file=production.tfvars -out=plan.tfplan
terraform apply plan.tfplan
Validation checkpoint:
- Workspace URL is accessible via Private Link
- SSO login works
- A test cluster can start and run a simple notebook
4. Phase 2: Data Layer (Weeks 7–10)
4.1 Unity Catalog Configuration
Deploy multi-cloud Unity Catalog using terraform/shared/unity-catalog-multicloud.tf:
- Create AWS-side metastore (if in a new region) or link to existing metastore
- Create external locations pointing to S3 buckets
- Configure storage credentials for cross-cloud access
- Create shared catalogs for data that must be accessible from both clouds
4.2 Data Replication Strategy
For data that must exist on both clouds, choose a replication pattern:
| Pattern | Use Case | Latency | Complexity |
|---|---|---|---|
| Delta Sharing | Read-only cross-cloud access | Real-time (query-time) | Low |
| Deep Clone | Full copy for DR or local performance | Batch (scheduled) | Medium |
| DLT with CDF | Near-real-time sync of changed data | Minutes | High |
| Custom Spark job | Complex transformation during replication | Configurable | High |
4.3 Delta Sharing Setup
For read-only cross-cloud access (most common):
-- On Azure (provider)
CREATE SHARE analytics_share;
ALTER SHARE analytics_share ADD TABLE catalog.analytics.sales_summary;
ALTER SHARE analytics_share ADD TABLE catalog.analytics.product_dimensions;
CREATE RECIPIENT aws_workspace
USING ID '<aws_metastore_id>';
GRANT SELECT ON SHARE analytics_share TO RECIPIENT aws_workspace;
-- On AWS (consumer) — access via Unity Catalog
-- Tables appear automatically under the shared catalog
SELECT * FROM shared_catalog.analytics.sales_summary;
4.4 Deep Clone for DR
For data that must be physically present on AWS for disaster recovery:
# Scheduled job running on AWS workspace
tables_to_replicate = [
"catalog.bronze.raw_transactions",
"catalog.silver.cleaned_transactions",
"catalog.gold.daily_aggregates",
]
for table in tables_to_replicate:
target_table = table.replace("catalog.", "dr_catalog.")
spark.sql(f"""
CREATE OR REPLACE TABLE {target_table}
DEEP CLONE {table}
""")
4.5 Data Validation
After replication, validate data integrity:
def validate_replication(source_table: str, target_table: str) -> bool:
source_count = spark.sql(f"SELECT COUNT(*) as cnt FROM {source_table}").first().cnt
target_count = spark.sql(f"SELECT COUNT(*) as cnt FROM {target_table}").first().cnt
source_checksum = spark.sql(
f"SELECT md5(concat_ws('|', *)) as hash FROM {source_table} ORDER BY 1"
)
target_checksum = spark.sql(
f"SELECT md5(concat_ws('|', *)) as hash FROM {target_table} ORDER BY 1"
)
count_match = source_count == target_count
hash_match = source_checksum.exceptAll(target_checksum).count() == 0
return count_match and hash_match
5. Phase 3: Workload Migration (Weeks 11–14)
5.1 Migration Order
Migrate workloads in order of increasing risk:
- Read-only analytics — SQL dashboards, reporting queries
- Batch jobs — scheduled ETL that can tolerate brief downtime
- ML training — model training workloads (not serving)
- Streaming — real-time ingestion and processing (highest risk)
5.2 Job Migration Checklist
For each job being migrated to AWS:
- [ ] Verify all source data is accessible from AWS workspace
- [ ] Update storage paths from
abfss://tos3://(or use Unity Catalog table names) - [ ] Update cluster configuration for AWS instance types
- [ ] Translate Azure Key Vault secret references to AWS Secrets Manager
- [ ] Update any hardcoded Azure resource references
- [ ] Run job in parallel on both clouds and compare outputs
- [ ] Validate output data matches Azure version
- [ ] Update downstream consumers to read from AWS output
- [ ] Decommission Azure version (or keep for DR)
5.3 Instance Type Mapping
| Azure VM Type | AWS Instance Type | vCPUs | Memory (GB) | Notes |
|---|---|---|---|---|
| Standard_DS3_v2 | m5.xlarge | 4 | 16 | General purpose |
| Standard_DS4_v2 | m5.2xlarge | 8 | 32 | General purpose |
| Standard_DS5_v2 | m5.4xlarge | 16 | 64 | General purpose |
| Standard_E8s_v3 | r5.2xlarge | 8 | 64 | Memory optimized |
| Standard_E16s_v3 | r5.4xlarge | 16 | 128 | Memory optimized |
| Standard_NC6s_v3 | p3.2xlarge | 6/8 | 112/61 | GPU (ML training) |
| Standard_L8s_v2 | i3.2xlarge | 8 | 61 | Storage optimized |
5.4 CI/CD Pipeline Update
Deploy the unified CI/CD pipeline from cicd/multi-cloud-pipeline.yml:
- Configure service connections for both Azure and AWS
- Update pipeline variables with both cloud workspace URLs
- Run pipeline in dry-run mode against both clouds
- Enable full deployment
6. Phase 4: Operations (Ongoing)
6.1 Monitoring Setup
- Deploy centralized monitoring per ADR-012
- Configure alerts for cross-cloud replication lag
- Set up cost anomaly detection for both clouds
- Enable Databricks SQL warehouse query monitoring
6.2 DR Drill Schedule
| Drill | Frequency | Duration | Scope |
|---|---|---|---|
| Connectivity test | Weekly (automated) | 5 minutes | VPN tunnel health, DNS resolution |
| Data validation | Daily (automated) | 30 minutes | Row count and checksum comparison |
| Failover test | Quarterly (manual) | 4 hours | Full failover to AWS, run critical jobs |
| Full DR exercise | Annually | 1 day | Complete failover, operate on AWS for 24h |
6.3 Cost Optimization Cadence
| Review | Frequency | Actions |
|---|---|---|
| Spot/reserved instance analysis | Monthly | Adjust reserved capacity commitments |
| Cross-cloud egress review | Monthly | Identify and reduce unnecessary data transfer |
| Unused resource cleanup | Weekly (automated) | Terminate idle clusters, delete orphaned resources |
| DBU rate comparison | Quarterly | Rebalance workloads between clouds for cost |
6.4 Rollback Plan
If multi-cloud operations prove unsustainable:
- Stop all AWS-originated write workloads
- Ensure all data is replicated back to Azure
- Update downstream consumers to point to Azure
- Decommission AWS workspace and infrastructure
- Retain Terraform state and configuration for potential future re-deployment
- Update Unity Catalog to remove AWS external locations
Estimated rollback duration: 2–4 weeks for a controlled unwinding.
7. Common Migration Issues
| Issue | Symptom | Resolution |
|---|---|---|
| SCIM sync delay | Users can't log into AWS workspace | Check SCIM provisioning logs in Azure AD; manual sync if needed |
| DNS resolution failure | Private endpoint connections time out | Verify cross-cloud DNS forwarders; check Route 53 resolver rules |
| VPN tunnel flapping | Intermittent connectivity | Check BGP timers; ensure DPD settings match on both sides |
| Delta Sharing latency | Cross-cloud queries slow | Verify VPN bandwidth; consider Deep Clone for high-frequency reads |
| Cluster startup failure | "Insufficient capacity" on AWS | Change to a different instance type or availability zone |
| Secret access errors | Jobs fail with 403 on AWS Secrets Manager | Verify IAM role trust policy includes Databricks service principal |
Multi-Cloud Lakehouse Blueprint — Datanest Digital — datanest.dev
This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Multi-Cloud Lakehouse Blueprint] with all files, templates, and documentation for $69.
Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.
Top comments (0)