DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Unity Catalog Migration Kit: Unity Catalog Migration Guide

Unity Catalog Migration Guide

Unity Catalog Migration Kit — Datanest Digital
https://datanest.dev


Table of Contents

  1. Overview
  2. Prerequisites
  3. Phase 1: Assessment
  4. Phase 2: Infrastructure Setup
  5. Phase 3: Migration Planning
  6. Phase 4: Table Migration
  7. Phase 5: Permission Mapping
  8. Phase 6: Validation
  9. Phase 7: Cutover
  10. Phase 8: Decommission
  11. Troubleshooting
  12. FAQ

Overview

This guide walks through migrating a Databricks workspace from legacy Hive Metastore to Unity Catalog using the tools provided in this kit. The process is designed to be:

  • Non-destructive — Source tables remain untouched throughout migration
  • Incremental — Migrate in waves, validating each wave before proceeding
  • Reversible — Rollback procedures exist for every phase (see rollback_procedures.md)

Migration Strategies

Strategy Best For Speed Risk
DEEP CLONE Delta tables Fast Low
CTAS Non-Delta tables (Parquet, CSV, JSON) Medium Low
SYNC Incremental re-sync of Delta tables Fast Low

The migration notebook automatically selects the best strategy per table.


Prerequisites

Workspace Requirements

  • Databricks workspace on a Unity Catalog-supported plan (Premium or Enterprise)
  • Unity Catalog enabled at the account level
  • Databricks Runtime 13.3 LTS or later
  • Workspace admin privileges

Account-Level Requirements

  • Account admin access (for metastore creation)
  • Cloud storage bucket/container for metastore root
  • IAM role (AWS), managed identity (Azure), or service account (GCP) for storage access

Local Tooling

  • Terraform >= 1.5.0
  • Python >= 3.9
  • Databricks CLI (optional, for notebook import)

Pre-Flight Checklist

  • [ ] Account-level Unity Catalog is enabled
  • [ ] Cloud storage for metastore root is provisioned
  • [ ] Storage credential (IAM role / managed identity) is configured
  • [ ] Workspace admin access confirmed
  • [ ] Cluster with DBR 13.3+ is available
  • [ ] Backup of existing Hive Metastore metadata (recommended)

Phase 1: Assessment

Estimated time: 1 day

1.1 Import the Assessment Notebook

Import notebooks/pre_migration_assessment.py into your Databricks workspace.

1.2 Configure Parameters

Parameter Description Example
output_path DBFS path for assessment output /tmp/unity_catalog_migration/assessment
include_data_profiling Whether to count rows (slower but more accurate) true
exclude_databases Databases to skip default,information_schema
assessment_tag Label for this assessment run initial

1.3 Run the Assessment

Run all cells. The notebook will:

  1. Discover all databases in Hive Metastore
  2. Catalog every table, view, and function with metadata
  3. Scan existing permissions (SHOW GRANTS)
  4. Score migration complexity per table
  5. Save a JSON report and Delta table

1.4 Review Results

The assessment produces:

  • JSON report — Full metadata for all objects
  • Delta table — Queryable assessment data
  • Summary — Table counts, size, complexity distribution, format breakdown

Download the JSON report for use with the CLI tools in the next phase.


Phase 2: Infrastructure Setup

Estimated time: 2 days

2.1 Configure Terraform Variables

Copy the example below and customize for your environment:

# terraform.tfvars
cloud_provider         = "aws"
region                 = "us-east-1"
metastore_name         = "primary"
metastore_storage_root = "s3://your-bucket/unity-catalog-metastore"
metastore_owner        = "admin@yourorg.com"

workspace_ids = ["1234567890"]

storage_credentials = {
  "main-credential" = {
    comment  = "Primary storage credential"
    role_arn = "arn:aws:iam::123456789012:role/unity-catalog-role"
  }
}

external_locations = {
  "raw-data" = {
    url             = "s3://your-bucket/raw"
    credential_name = "main-credential"
    comment         = "Raw data landing zone"
  }
}

catalogs = {
  "production" = {
    comment = "Production catalog"
    grants = [
      {
        principal  = "data-engineers"
        privileges = ["USE CATALOG", "CREATE SCHEMA"]
      }
    ]
  }
  "staging" = {
    comment = "Staging catalog for migration validation"
  }
}

schemas = {
  "production.sales" = {
    catalog_name = "production"
    schema_name  = "sales"
    grants = [
      {
        principal  = "data-engineers"
        privileges = ["USE SCHEMA", "CREATE TABLE", "SELECT"]
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

2.2 Deploy

cd terraform/unity-catalog-setup
terraform init
terraform plan -var-file="terraform.tfvars"
terraform apply -var-file="terraform.tfvars"
Enter fullscreen mode Exit fullscreen mode

2.3 Verify

Confirm in the Databricks UI:

  • Metastore appears under Data tab
  • Catalogs and schemas are visible
  • External locations are accessible (test with a simple read)

Phase 3: Migration Planning

Estimated time: 0.5 days

3.1 Generate Migration Waves

Use the migration planner CLI tool:

python tools/migration_planner.py \
  --assessment-file assessment_report.json \
  --max-tables-per-wave 50 \
  --max-size-gb-per-wave 500 \
  --output waves.json
Enter fullscreen mode Exit fullscreen mode

3.2 Estimate Timeline

python tools/timeline_estimator.py \
  --assessment-file assessment_report.json \
  --parallel-streams 4
Enter fullscreen mode Exit fullscreen mode

3.3 Review and Adjust

Open waves.json and review. Consider:

  • Moving critical business tables to later waves (after validation patterns are proven)
  • Grouping tables with interdependencies into the same wave
  • Scheduling large tables during off-peak hours
  • Using the CSV template (templates/migration_planning_spreadsheet.csv) to track assignments

Phase 4: Table Migration

Estimated time: Varies (see timeline estimator)

4.1 Import the Migration Notebook

Import notebooks/automated_table_migration.py into your workspace.

4.2 Dry Run First

For each wave, run the notebook with dry_run = true:

Parameter Value
target_catalog Your target catalog name
source_database Source Hive database for this wave
migration_mode ctas (or clone for Delta-only)
dry_run true

Review the output to confirm which tables will be migrated and their row counts.

4.3 Execute Migration

Set dry_run = false and run. The notebook:

  1. Creates the target schema if it doesn't exist
  2. Selects the best migration strategy per table (DEEP CLONE for Delta, CTAS for others)
  3. Migrates each table with progress logging
  4. Performs quick row count validation per table
  5. Handles views separately (rewrites DDL to reference UC)
  6. Saves a checkpoint log

4.4 Monitor Progress

Watch the notebook output for [OK], [ERR], or [SKIP] indicators. Failed tables are logged with error messages for troubleshooting.

4.5 Repeat for Each Wave

Repeat steps 4.2–4.4 for each migration wave. The notebook tracks its own checkpoints so it can resume if interrupted.


Phase 5: Permission Mapping

Estimated time: 1 day

5.1 Import the Permission Mapper

Import notebooks/permission_mapper.py into your workspace.

5.2 Configure Group Mapping (Optional)

If your identity provider groups have been renamed or restructured for Unity Catalog, provide a JSON mapping:

{"old_group_name": "new_group_name", "legacy_admins": "platform_admins"}
Enter fullscreen mode Exit fullscreen mode

5.3 Dry Run

Run with dry_run = true to preview all GRANT statements that will be applied.

5.4 Apply Permissions

Set dry_run = false to execute the grants. Review the output report for any failures.


Phase 6: Validation

Estimated time: 1 day

6.1 Run Post-Migration Validation

Import and run notebooks/post_migration_validation.py:

Parameter Value
source_database Hive database name
target_catalog UC catalog name
sample_size 1000 (adjust based on thoroughness needed)
fail_on_mismatch false (set to true for CI pipelines)

6.2 Review Results

The validation checks three dimensions per table:

  • Row count — Exact match between source and target
  • Schema — Column names, types, and nullability
  • Data sample — Hash-based comparison of sampled rows

6.3 Address Failures

Common validation issues and resolutions:

Issue Cause Resolution
Row count mismatch Active writes during migration Re-run migration for affected tables
Schema difference Type promotion during CTAS Verify type is compatible; adjust if needed
Data hash mismatch Floating-point precision Check if differences are within acceptable tolerance

Phase 7: Cutover

Estimated time: 0.5 days

7.1 Update Application References

Update all notebooks, jobs, and pipelines to reference Unity Catalog three-level namespace:

# Before (Hive Metastore)
spark.table("database.table")

# After (Unity Catalog)
spark.table("catalog.schema.table")
Enter fullscreen mode Exit fullscreen mode

7.2 Update SQL Queries

-- Before
SELECT * FROM database.table

-- After
SELECT * FROM catalog.schema.table
Enter fullscreen mode Exit fullscreen mode

7.3 Set Default Catalog

Configure the default catalog for each cluster or SQL warehouse:

USE CATALOG production;
Enter fullscreen mode Exit fullscreen mode

Or set spark.databricks.sql.initial.catalog.name in cluster configuration.

7.4 Verify Job Execution

Run a representative set of jobs against Unity Catalog and confirm correct behavior.


Phase 8: Decommission

Estimated time: 2 days

8.1 Parallel Operation Period

Run both Hive Metastore and Unity Catalog in parallel for a defined period (recommended: 1–2 weeks) to catch any missed references.

8.2 Remove Hive Metastore References

After the parallel period, remove all remaining references to hive_metastore in notebooks, jobs, and configuration.

8.3 Archive or Drop Source Tables

Once confident, archive or drop the original Hive Metastore tables. Consider keeping a metadata backup before dropping.


Troubleshooting

Common Issues

"Cannot create table — schema does not exist"
Ensure the target schema was created. Check Terraform output or create manually:

CREATE SCHEMA IF NOT EXISTS catalog.schema;
Enter fullscreen mode Exit fullscreen mode

"Access denied" during migration
The migration cluster user needs CREATE TABLE on the target schema and SELECT on the source tables. Verify with:

SHOW GRANTS ON SCHEMA catalog.schema;
Enter fullscreen mode Exit fullscreen mode

"Table already exists"
Set overwrite_existing = true in the migration notebook, or manually drop the target table first.

External table migration fails
External tables require a storage credential and external location configured in Unity Catalog that covers the table's storage path.

View migration fails with reference error
Views that reference multiple databases may need manual DDL adjustment. Check the error message for the specific unresolved reference.


FAQ

Q: Can I migrate incrementally?
A: Yes. The migration notebook supports running in waves and tracks checkpoints. You can also use sync mode for Delta tables to perform incremental updates.

Q: Is the migration destructive?
A: No. Source Hive Metastore tables are never modified or deleted by any notebook in this kit.

Q: What about streaming tables and materialized views?
A: These must be recreated in Unity Catalog as they cannot be migrated via CTAS/CLONE. Recreate using the original DDL with UC references.

Q: Do I need to stop jobs during migration?
A: Not necessarily for the migration itself, but active writes to source tables during migration may cause row count mismatches. Schedule large table migrations during low-activity windows.

Q: How do I handle cross-database views?
A: Views that reference tables in multiple databases need their DDL manually updated to use the full UC three-level namespace for each referenced table.


Unity Catalog Migration Kit — Datanest Digital
https://datanest.dev


This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Unity Catalog Migration Kit] with all files, templates, and documentation for $69.

Get the Full Kit →

Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)