Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Unity Catalog Migration Kit: Unity Catalog Migration Guide

#unity3d #databricks #azure #dataengineering

Unity Catalog Migration Guide

Unity Catalog Migration Kit — Datanest Digital
https://datanest.dev

Overview
Prerequisites
Phase 1: Assessment
Phase 2: Infrastructure Setup
Phase 3: Migration Planning
Phase 4: Table Migration
Phase 5: Permission Mapping
Phase 6: Validation
Phase 7: Cutover
Phase 8: Decommission
Troubleshooting
FAQ

Overview

This guide walks through migrating a Databricks workspace from legacy Hive Metastore to Unity Catalog using the tools provided in this kit. The process is designed to be:

Non-destructive — Source tables remain untouched throughout migration
Incremental — Migrate in waves, validating each wave before proceeding
Reversible — Rollback procedures exist for every phase (see rollback_procedures.md)

Migration Strategies

Strategy	Best For	Speed	Risk
DEEP CLONE	Delta tables	Fast	Low
CTAS	Non-Delta tables (Parquet, CSV, JSON)	Medium	Low
SYNC	Incremental re-sync of Delta tables	Fast	Low

The migration notebook automatically selects the best strategy per table.

Prerequisites

Workspace Requirements

Databricks workspace on a Unity Catalog-supported plan (Premium or Enterprise)
Unity Catalog enabled at the account level
Databricks Runtime 13.3 LTS or later
Workspace admin privileges

Account-Level Requirements

Account admin access (for metastore creation)
Cloud storage bucket/container for metastore root
IAM role (AWS), managed identity (Azure), or service account (GCP) for storage access

Local Tooling

Terraform >= 1.5.0
Python >= 3.9
Databricks CLI (optional, for notebook import)

Pre-Flight Checklist

[ ] Account-level Unity Catalog is enabled
[ ] Cloud storage for metastore root is provisioned
[ ] Storage credential (IAM role / managed identity) is configured
[ ] Workspace admin access confirmed
[ ] Cluster with DBR 13.3+ is available
[ ] Backup of existing Hive Metastore metadata (recommended)

Phase 1: Assessment

Estimated time: 1 day

1.1 Import the Assessment Notebook

Import notebooks/pre_migration_assessment.py into your Databricks workspace.

1.2 Configure Parameters

Parameter	Description	Example
`output_path`	DBFS path for assessment output	`/tmp/unity_catalog_migration/assessment`
`include_data_profiling`	Whether to count rows (slower but more accurate)	`true`
`exclude_databases`	Databases to skip	`default,information_schema`
`assessment_tag`	Label for this assessment run	`initial`

1.3 Run the Assessment

Run all cells. The notebook will:

Discover all databases in Hive Metastore
Catalog every table, view, and function with metadata
Scan existing permissions (SHOW GRANTS)
Score migration complexity per table
Save a JSON report and Delta table

1.4 Review Results

The assessment produces:

JSON report — Full metadata for all objects
Delta table — Queryable assessment data
Summary — Table counts, size, complexity distribution, format breakdown

Download the JSON report for use with the CLI tools in the next phase.

Phase 2: Infrastructure Setup

Estimated time: 2 days

2.1 Configure Terraform Variables

Copy the example below and customize for your environment:

# terraform.tfvars
cloud_provider         = "aws"
region                 = "us-east-1"
metastore_name         = "primary"
metastore_storage_root = "s3://your-bucket/unity-catalog-metastore"
metastore_owner        = "admin@yourorg.com"

workspace_ids = ["1234567890"]

storage_credentials = {
  "main-credential" = {
    comment  = "Primary storage credential"
    role_arn = "arn:aws:iam::123456789012:role/unity-catalog-role"
  }
}

external_locations = {
  "raw-data" = {
    url             = "s3://your-bucket/raw"
    credential_name = "main-credential"
    comment         = "Raw data landing zone"
  }
}

catalogs = {
  "production" = {
    comment = "Production catalog"
    grants = [
      {
        principal  = "data-engineers"
        privileges = ["USE CATALOG", "CREATE SCHEMA"]
      }
    ]
  }
  "staging" = {
    comment = "Staging catalog for migration validation"
  }
}

schemas = {
  "production.sales" = {
    catalog_name = "production"
    schema_name  = "sales"
    grants = [
      {
        principal  = "data-engineers"
        privileges = ["USE SCHEMA", "CREATE TABLE", "SELECT"]
      }
    ]
  }
}

2.2 Deploy

cd terraform/unity-catalog-setup
terraform init
terraform plan -var-file="terraform.tfvars"
terraform apply -var-file="terraform.tfvars"

2.3 Verify

Confirm in the Databricks UI:

Metastore appears under Data tab
Catalogs and schemas are visible
External locations are accessible (test with a simple read)

Phase 3: Migration Planning

Estimated time: 0.5 days

3.1 Generate Migration Waves

Use the migration planner CLI tool:

python tools/migration_planner.py \
  --assessment-file assessment_report.json \
  --max-tables-per-wave 50 \
  --max-size-gb-per-wave 500 \
  --output waves.json

3.2 Estimate Timeline

python tools/timeline_estimator.py \
  --assessment-file assessment_report.json \
  --parallel-streams 4

3.3 Review and Adjust

Open waves.json and review. Consider:

Moving critical business tables to later waves (after validation patterns are proven)
Grouping tables with interdependencies into the same wave
Scheduling large tables during off-peak hours
Using the CSV template (templates/migration_planning_spreadsheet.csv) to track assignments

Phase 4: Table Migration

Estimated time: Varies (see timeline estimator)

4.1 Import the Migration Notebook

Import notebooks/automated_table_migration.py into your workspace.

4.2 Dry Run First

For each wave, run the notebook with dry_run = true:

Parameter	Value
`target_catalog`	Your target catalog name
`source_database`	Source Hive database for this wave
`migration_mode`	`ctas` (or `clone` for Delta-only)
`dry_run`	`true`

Review the output to confirm which tables will be migrated and their row counts.

4.3 Execute Migration

Set dry_run = false and run. The notebook:

Creates the target schema if it doesn't exist
Selects the best migration strategy per table (DEEP CLONE for Delta, CTAS for others)
Migrates each table with progress logging
Performs quick row count validation per table
Handles views separately (rewrites DDL to reference UC)
Saves a checkpoint log

4.4 Monitor Progress

Watch the notebook output for [OK], [ERR], or [SKIP] indicators. Failed tables are logged with error messages for troubleshooting.

4.5 Repeat for Each Wave

Repeat steps 4.2–4.4 for each migration wave. The notebook tracks its own checkpoints so it can resume if interrupted.

Phase 5: Permission Mapping

Estimated time: 1 day

5.1 Import the Permission Mapper

Import notebooks/permission_mapper.py into your workspace.

5.2 Configure Group Mapping (Optional)

If your identity provider groups have been renamed or restructured for Unity Catalog, provide a JSON mapping:

{"old_group_name": "new_group_name", "legacy_admins": "platform_admins"}

5.3 Dry Run

Run with dry_run = true to preview all GRANT statements that will be applied.

5.4 Apply Permissions

Set dry_run = false to execute the grants. Review the output report for any failures.

Phase 6: Validation

Estimated time: 1 day

6.1 Run Post-Migration Validation

Import and run notebooks/post_migration_validation.py:

Parameter	Value
`source_database`	Hive database name
`target_catalog`	UC catalog name
`sample_size`	`1000` (adjust based on thoroughness needed)
`fail_on_mismatch`	`false` (set to `true` for CI pipelines)

6.2 Review Results

The validation checks three dimensions per table:

Row count — Exact match between source and target
Schema — Column names, types, and nullability
Data sample — Hash-based comparison of sampled rows

6.3 Address Failures

Common validation issues and resolutions:

Issue	Cause	Resolution
Row count mismatch	Active writes during migration	Re-run migration for affected tables
Schema difference	Type promotion during CTAS	Verify type is compatible; adjust if needed
Data hash mismatch	Floating-point precision	Check if differences are within acceptable tolerance

Phase 7: Cutover

Estimated time: 0.5 days

7.1 Update Application References

Update all notebooks, jobs, and pipelines to reference Unity Catalog three-level namespace:

# Before (Hive Metastore)
spark.table("database.table")

# After (Unity Catalog)
spark.table("catalog.schema.table")

7.2 Update SQL Queries

-- Before
SELECT * FROM database.table

-- After
SELECT * FROM catalog.schema.table

7.3 Set Default Catalog

Configure the default catalog for each cluster or SQL warehouse:

USE CATALOG production;

Or set spark.databricks.sql.initial.catalog.name in cluster configuration.

7.4 Verify Job Execution

Run a representative set of jobs against Unity Catalog and confirm correct behavior.

Phase 8: Decommission

Estimated time: 2 days

8.1 Parallel Operation Period

Run both Hive Metastore and Unity Catalog in parallel for a defined period (recommended: 1–2 weeks) to catch any missed references.

8.2 Remove Hive Metastore References

After the parallel period, remove all remaining references to hive_metastore in notebooks, jobs, and configuration.

8.3 Archive or Drop Source Tables

Once confident, archive or drop the original Hive Metastore tables. Consider keeping a metadata backup before dropping.

Troubleshooting

Common Issues

"Cannot create table — schema does not exist"
Ensure the target schema was created. Check Terraform output or create manually:

CREATE SCHEMA IF NOT EXISTS catalog.schema;

"Access denied" during migration
The migration cluster user needs CREATE TABLE on the target schema and SELECT on the source tables. Verify with:

SHOW GRANTS ON SCHEMA catalog.schema;

"Table already exists"
Set overwrite_existing = true in the migration notebook, or manually drop the target table first.

External table migration fails
External tables require a storage credential and external location configured in Unity Catalog that covers the table's storage path.

View migration fails with reference error
Views that reference multiple databases may need manual DDL adjustment. Check the error message for the specific unresolved reference.

FAQ

Q: Can I migrate incrementally?
A: Yes. The migration notebook supports running in waves and tracks checkpoints. You can also use sync mode for Delta tables to perform incremental updates.

Q: Is the migration destructive?
A: No. Source Hive Metastore tables are never modified or deleted by any notebook in this kit.

Q: What about streaming tables and materialized views?
A: These must be recreated in Unity Catalog as they cannot be migrated via CTAS/CLONE. Recreate using the original DDL with UC references.

Q: Do I need to stop jobs during migration?
A: Not necessarily for the migration itself, but active writes to source tables during migration may cause row count mismatches. Schedule large table migrations during low-activity windows.

Q: How do I handle cross-database views?
A: Views that reference tables in multiple databases need their DDL manually updated to use the full UC three-level namespace for each referenced table.

Unity Catalog Migration Kit — Datanest Digital
https://datanest.dev

This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Unity Catalog Migration Kit] with all files, templates, and documentation for $69.

Get the Full Kit →

Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.

Get the Complete Bundle →

Unity Catalog Migration Guide

Table of Contents

Overview

Migration Strategies

Prerequisites

Workspace Requirements

Account-Level Requirements

Local Tooling

Pre-Flight Checklist

Phase 1: Assessment

1.1 Import the Assessment Notebook

1.2 Configure Parameters

1.3 Run the Assessment

1.4 Review Results

Phase 2: Infrastructure Setup

2.1 Configure Terraform Variables

2.2 Deploy

2.3 Verify

Phase 3: Migration Planning

3.1 Generate Migration Waves

3.2 Estimate Timeline

3.3 Review and Adjust

Phase 4: Table Migration

4.1 Import the Migration Notebook

4.2 Dry Run First

4.3 Execute Migration

4.4 Monitor Progress

4.5 Repeat for Each Wave

Phase 5: Permission Mapping

5.1 Import the Permission Mapper

5.2 Configure Group Mapping (Optional)

5.3 Dry Run

5.4 Apply Permissions

Phase 6: Validation

6.1 Run Post-Migration Validation

6.2 Review Results

6.3 Address Failures

Phase 7: Cutover

7.1 Update Application References

7.2 Update SQL Queries

7.3 Set Default Catalog

7.4 Verify Job Execution

Phase 8: Decommission

8.1 Parallel Operation Period

8.2 Remove Hive Metastore References

8.3 Archive or Drop Source Tables

Troubleshooting

Common Issues

FAQ

Related Articles