Unity Catalog Migration Guide
Unity Catalog Migration Kit — Datanest Digital
https://datanest.dev
Table of Contents
- Overview
- Prerequisites
- Phase 1: Assessment
- Phase 2: Infrastructure Setup
- Phase 3: Migration Planning
- Phase 4: Table Migration
- Phase 5: Permission Mapping
- Phase 6: Validation
- Phase 7: Cutover
- Phase 8: Decommission
- Troubleshooting
- FAQ
Overview
This guide walks through migrating a Databricks workspace from legacy Hive Metastore to Unity Catalog using the tools provided in this kit. The process is designed to be:
- Non-destructive — Source tables remain untouched throughout migration
- Incremental — Migrate in waves, validating each wave before proceeding
-
Reversible — Rollback procedures exist for every phase (see
rollback_procedures.md)
Migration Strategies
| Strategy | Best For | Speed | Risk |
|---|---|---|---|
| DEEP CLONE | Delta tables | Fast | Low |
| CTAS | Non-Delta tables (Parquet, CSV, JSON) | Medium | Low |
| SYNC | Incremental re-sync of Delta tables | Fast | Low |
The migration notebook automatically selects the best strategy per table.
Prerequisites
Workspace Requirements
- Databricks workspace on a Unity Catalog-supported plan (Premium or Enterprise)
- Unity Catalog enabled at the account level
- Databricks Runtime 13.3 LTS or later
- Workspace admin privileges
Account-Level Requirements
- Account admin access (for metastore creation)
- Cloud storage bucket/container for metastore root
- IAM role (AWS), managed identity (Azure), or service account (GCP) for storage access
Local Tooling
- Terraform >= 1.5.0
- Python >= 3.9
- Databricks CLI (optional, for notebook import)
Pre-Flight Checklist
- [ ] Account-level Unity Catalog is enabled
- [ ] Cloud storage for metastore root is provisioned
- [ ] Storage credential (IAM role / managed identity) is configured
- [ ] Workspace admin access confirmed
- [ ] Cluster with DBR 13.3+ is available
- [ ] Backup of existing Hive Metastore metadata (recommended)
Phase 1: Assessment
Estimated time: 1 day
1.1 Import the Assessment Notebook
Import notebooks/pre_migration_assessment.py into your Databricks workspace.
1.2 Configure Parameters
| Parameter | Description | Example |
|---|---|---|
output_path |
DBFS path for assessment output | /tmp/unity_catalog_migration/assessment |
include_data_profiling |
Whether to count rows (slower but more accurate) | true |
exclude_databases |
Databases to skip | default,information_schema |
assessment_tag |
Label for this assessment run | initial |
1.3 Run the Assessment
Run all cells. The notebook will:
- Discover all databases in Hive Metastore
- Catalog every table, view, and function with metadata
- Scan existing permissions (SHOW GRANTS)
- Score migration complexity per table
- Save a JSON report and Delta table
1.4 Review Results
The assessment produces:
- JSON report — Full metadata for all objects
- Delta table — Queryable assessment data
- Summary — Table counts, size, complexity distribution, format breakdown
Download the JSON report for use with the CLI tools in the next phase.
Phase 2: Infrastructure Setup
Estimated time: 2 days
2.1 Configure Terraform Variables
Copy the example below and customize for your environment:
# terraform.tfvars
cloud_provider = "aws"
region = "us-east-1"
metastore_name = "primary"
metastore_storage_root = "s3://your-bucket/unity-catalog-metastore"
metastore_owner = "admin@yourorg.com"
workspace_ids = ["1234567890"]
storage_credentials = {
"main-credential" = {
comment = "Primary storage credential"
role_arn = "arn:aws:iam::123456789012:role/unity-catalog-role"
}
}
external_locations = {
"raw-data" = {
url = "s3://your-bucket/raw"
credential_name = "main-credential"
comment = "Raw data landing zone"
}
}
catalogs = {
"production" = {
comment = "Production catalog"
grants = [
{
principal = "data-engineers"
privileges = ["USE CATALOG", "CREATE SCHEMA"]
}
]
}
"staging" = {
comment = "Staging catalog for migration validation"
}
}
schemas = {
"production.sales" = {
catalog_name = "production"
schema_name = "sales"
grants = [
{
principal = "data-engineers"
privileges = ["USE SCHEMA", "CREATE TABLE", "SELECT"]
}
]
}
}
2.2 Deploy
cd terraform/unity-catalog-setup
terraform init
terraform plan -var-file="terraform.tfvars"
terraform apply -var-file="terraform.tfvars"
2.3 Verify
Confirm in the Databricks UI:
- Metastore appears under Data tab
- Catalogs and schemas are visible
- External locations are accessible (test with a simple read)
Phase 3: Migration Planning
Estimated time: 0.5 days
3.1 Generate Migration Waves
Use the migration planner CLI tool:
python tools/migration_planner.py \
--assessment-file assessment_report.json \
--max-tables-per-wave 50 \
--max-size-gb-per-wave 500 \
--output waves.json
3.2 Estimate Timeline
python tools/timeline_estimator.py \
--assessment-file assessment_report.json \
--parallel-streams 4
3.3 Review and Adjust
Open waves.json and review. Consider:
- Moving critical business tables to later waves (after validation patterns are proven)
- Grouping tables with interdependencies into the same wave
- Scheduling large tables during off-peak hours
- Using the CSV template (
templates/migration_planning_spreadsheet.csv) to track assignments
Phase 4: Table Migration
Estimated time: Varies (see timeline estimator)
4.1 Import the Migration Notebook
Import notebooks/automated_table_migration.py into your workspace.
4.2 Dry Run First
For each wave, run the notebook with dry_run = true:
| Parameter | Value |
|---|---|
target_catalog |
Your target catalog name |
source_database |
Source Hive database for this wave |
migration_mode |
ctas (or clone for Delta-only) |
dry_run |
true |
Review the output to confirm which tables will be migrated and their row counts.
4.3 Execute Migration
Set dry_run = false and run. The notebook:
- Creates the target schema if it doesn't exist
- Selects the best migration strategy per table (DEEP CLONE for Delta, CTAS for others)
- Migrates each table with progress logging
- Performs quick row count validation per table
- Handles views separately (rewrites DDL to reference UC)
- Saves a checkpoint log
4.4 Monitor Progress
Watch the notebook output for [OK], [ERR], or [SKIP] indicators. Failed tables are logged with error messages for troubleshooting.
4.5 Repeat for Each Wave
Repeat steps 4.2–4.4 for each migration wave. The notebook tracks its own checkpoints so it can resume if interrupted.
Phase 5: Permission Mapping
Estimated time: 1 day
5.1 Import the Permission Mapper
Import notebooks/permission_mapper.py into your workspace.
5.2 Configure Group Mapping (Optional)
If your identity provider groups have been renamed or restructured for Unity Catalog, provide a JSON mapping:
{"old_group_name": "new_group_name", "legacy_admins": "platform_admins"}
5.3 Dry Run
Run with dry_run = true to preview all GRANT statements that will be applied.
5.4 Apply Permissions
Set dry_run = false to execute the grants. Review the output report for any failures.
Phase 6: Validation
Estimated time: 1 day
6.1 Run Post-Migration Validation
Import and run notebooks/post_migration_validation.py:
| Parameter | Value |
|---|---|
source_database |
Hive database name |
target_catalog |
UC catalog name |
sample_size |
1000 (adjust based on thoroughness needed) |
fail_on_mismatch |
false (set to true for CI pipelines) |
6.2 Review Results
The validation checks three dimensions per table:
- Row count — Exact match between source and target
- Schema — Column names, types, and nullability
- Data sample — Hash-based comparison of sampled rows
6.3 Address Failures
Common validation issues and resolutions:
| Issue | Cause | Resolution |
|---|---|---|
| Row count mismatch | Active writes during migration | Re-run migration for affected tables |
| Schema difference | Type promotion during CTAS | Verify type is compatible; adjust if needed |
| Data hash mismatch | Floating-point precision | Check if differences are within acceptable tolerance |
Phase 7: Cutover
Estimated time: 0.5 days
7.1 Update Application References
Update all notebooks, jobs, and pipelines to reference Unity Catalog three-level namespace:
# Before (Hive Metastore)
spark.table("database.table")
# After (Unity Catalog)
spark.table("catalog.schema.table")
7.2 Update SQL Queries
-- Before
SELECT * FROM database.table
-- After
SELECT * FROM catalog.schema.table
7.3 Set Default Catalog
Configure the default catalog for each cluster or SQL warehouse:
USE CATALOG production;
Or set spark.databricks.sql.initial.catalog.name in cluster configuration.
7.4 Verify Job Execution
Run a representative set of jobs against Unity Catalog and confirm correct behavior.
Phase 8: Decommission
Estimated time: 2 days
8.1 Parallel Operation Period
Run both Hive Metastore and Unity Catalog in parallel for a defined period (recommended: 1–2 weeks) to catch any missed references.
8.2 Remove Hive Metastore References
After the parallel period, remove all remaining references to hive_metastore in notebooks, jobs, and configuration.
8.3 Archive or Drop Source Tables
Once confident, archive or drop the original Hive Metastore tables. Consider keeping a metadata backup before dropping.
Troubleshooting
Common Issues
"Cannot create table — schema does not exist"
Ensure the target schema was created. Check Terraform output or create manually:
CREATE SCHEMA IF NOT EXISTS catalog.schema;
"Access denied" during migration
The migration cluster user needs CREATE TABLE on the target schema and SELECT on the source tables. Verify with:
SHOW GRANTS ON SCHEMA catalog.schema;
"Table already exists"
Set overwrite_existing = true in the migration notebook, or manually drop the target table first.
External table migration fails
External tables require a storage credential and external location configured in Unity Catalog that covers the table's storage path.
View migration fails with reference error
Views that reference multiple databases may need manual DDL adjustment. Check the error message for the specific unresolved reference.
FAQ
Q: Can I migrate incrementally?
A: Yes. The migration notebook supports running in waves and tracks checkpoints. You can also use sync mode for Delta tables to perform incremental updates.
Q: Is the migration destructive?
A: No. Source Hive Metastore tables are never modified or deleted by any notebook in this kit.
Q: What about streaming tables and materialized views?
A: These must be recreated in Unity Catalog as they cannot be migrated via CTAS/CLONE. Recreate using the original DDL with UC references.
Q: Do I need to stop jobs during migration?
A: Not necessarily for the migration itself, but active writes to source tables during migration may cause row count mismatches. Schedule large table migrations during low-activity windows.
Q: How do I handle cross-database views?
A: Views that reference tables in multiple databases need their DDL manually updated to use the full UC three-level namespace for each referenced table.
Unity Catalog Migration Kit — Datanest Digital
https://datanest.dev
This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Unity Catalog Migration Kit] with all files, templates, and documentation for $69.
Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.
Top comments (0)