Data Catalog Builder — Guide
Automated metadata discovery, classification, quality scoring, and documentation for Databricks Unity Catalog.
Table of Contents
- Prerequisites
- Scanner Configuration
- Running the Catalog Scan
- Metadata Enrichment
- Column Classification
- Quality Scoring
- Report Generation
- Search Index
- Dashboard Usage
- Troubleshooting
Prerequisites
Before using Data Catalog Builder, ensure:
- Databricks Runtime 13.3+ with Unity Catalog enabled
- Cluster access to the catalogs you want to scan
-
Python packages:
pyyaml,jinja2(both included in DBR by default) -
Permissions:
USE CATALOG,USE SCHEMA, andSELECToninformation_schema/ target tables - For lineage features: access to
system.access.table_lineageandsystem.access.column_lineage
Recommended Cluster Config
| Setting | Value |
|---|---|
| Runtime | 13.3 LTS or 14.x |
| Node type | Standard_DS3_v2 (or equivalent) |
| Workers | 1–2 (metadata-only workload) |
| Unity Catalog | Enabled |
Scanner Configuration
The scanner is driven by configs/catalog_config.yaml. Key settings:
Catalogs and Schemas
scanner:
catalogs:
- "main"
schemas:
- "bronze"
- "silver"
- "gold"
Leave schemas empty to scan all schemas in each catalog. This is useful for initial discovery but may be slow on catalogs with hundreds of schemas.
Exclusion Patterns
Filter out temporary and internal tables using glob-style patterns:
scanner:
exclude_patterns:
- "_tmp_*"
- "_staging_*"
- "__apply_changes*"
- "_checkpoint_*"
Patterns are matched against the table name only (not the full catalog.schema.table path). The * wildcard matches any sequence of characters.
Row Counts
Setting include_row_counts: true runs COUNT(*) on every discovered table. This provides useful metadata but can be slow on large tables. Leave it false for initial scans, then enable selectively.
Running the Catalog Scan
Via Notebook
Open notebooks/scan_catalog.py and configure the widgets:
| Widget | Default | Description |
|---|---|---|
| Config Path | configs/catalog_config.yaml |
Scanner + enrichment config |
| Rules Path | configs/classification_rules.yaml |
Column classification rules |
| Output Directory | /Workspace/docs/catalog |
Where Markdown docs are written |
Run all cells. The notebook executes the full pipeline:
- Scan — discovers tables and columns
- Enrich — applies tags and schema descriptions
- Classify — labels PII/sensitive columns
- Score — computes quality scores per table
- Index — builds the searchable catalog
- Report — generates Markdown documentation
Programmatically
from src.catalog_scanner import CatalogScanner
from src.metadata_enricher import MetadataEnricher
from src.quality_scorer import QualityScorer
scanner = CatalogScanner(config_path="configs/catalog_config.yaml")
snapshot = scanner.scan_all()
enricher = MetadataEnricher(config_path="configs/catalog_config.yaml")
enricher.apply_tags(snapshot)
enricher.classify_columns(snapshot)
scorer = QualityScorer()
scores = scorer.score_catalog(snapshot)
Metadata Enrichment
The MetadataEnricher adds business context to discovered metadata:
Tag Rules
Tags are assigned to tables based on their schema name:
enrichment:
tag_rules:
bronze: ["raw", "ingestion", "append-only"]
silver: ["cleansed", "validated", "deduplicated"]
gold: ["curated", "business-ready", "aggregated"]
The enricher also infers medallion layer tags (bronze-layer, silver-layer, gold-layer) automatically from schema names.
Schema Descriptions
Tables without comments receive a default description based on their schema:
enrichment:
schema_descriptions:
bronze: "Raw ingested data"
silver: "Cleansed and validated data"
gold: "Business-ready curated data"
Default Owner
Tables without an explicit owner are assigned default_owner (default: "data-engineering").
Column Classification
Column classification identifies sensitive data using pattern-matching rules defined in configs/classification_rules.yaml.
Rule Structure
rules:
- name: "email_address"
classification: "PII"
column_patterns:
- "email"
- "e_mail"
- "email_address"
description: "Email address fields"
Each rule specifies:
-
classification: Label applied to matching columns (PII,SENSITIVE,CONFIDENTIAL) -
column_patterns: Regex patterns matched against column names (case-insensitive) -
type_patterns: Optional regex patterns matched against data types
The first matching rule wins. Order your rules from most specific to most general.
Built-in Rule Categories
| Classification | Examples |
|---|---|
| PII | email, phone, SSN, address, date_of_birth |
| SENSITIVE | salary, credit_card, bank_account, password |
| CONFIDENTIAL | api_key, secret, token |
Quality Scoring
Every table receives a 0–100 quality score based on four weighted dimensions:
Dimensions
| Dimension | Weight | What It Measures |
|---|---|---|
| Documentation | 30% | Table comment (40pts) + column comment coverage (60pts) |
| Completeness | 30% | Average non-null rate across columns |
| Freshness | 20% | Time since last update vs. threshold (default 24h) |
| Schema | 20% | snake_case naming (20pts) + no generics (20pts) + type diversity (60pts) |
Configuration
quality:
weights:
documentation: 0.30
completeness: 0.30
freshness: 0.20
schema: 0.20
freshness_threshold_hours: 24
Weights must sum to 1.0. Adjust to match your organisation's priorities. For example, if documentation is your top concern, increase its weight.
Interpreting Scores
| Range | Label | Action |
|---|---|---|
| 80–100 | Good | Meets standards |
| 50–79 | Fair | Needs improvement |
| 0–49 | Poor | Requires immediate attention |
Report Generation
The CatalogReporter generates per-table Markdown files using a Jinja2 template.
Template
The default template at configs/templates/table_doc.md.j2 renders:
- Table name, type, owner, and description
- Column table with name, type, nullable, comment, and classification
- Quality score breakdown
- Generation timestamp
Customise the template to match your organisation's documentation standards.
Output
Generated files are written to the configured output_dir (default: /Workspace/docs/catalog):
/Workspace/docs/catalog/
├── index.md # Catalog index with all tables
├── raw_customers.md # Per-table documentation
├── orders_cleansed.md
└── customer_summary.md
Search Index
The SearchIndex provides fast keyword search across all catalog metadata.
What's Indexed
- Table names
- Table descriptions (comments)
- Tags
- Column names
- Column descriptions
- Column classifications
Search API
from src.search_index import SearchIndex
index = SearchIndex()
index.build(snapshot)
# Multi-word search (all terms must match)
results = index.search("customer email")
for r in results:
print(f"{r.table_full_name}.{r.column_name} [{r.match_field}] score={r.score}")
Relevance Scoring
Results are ranked by:
- Exact match: 10 points
- Starts-with: 5 points
- Contains: 1 point
Multiplied by field weights: table_name (3x), tag (2.5x), column_name (2x), classification (2x), descriptions (1x).
Dashboard Usage
Open notebooks/catalog_dashboard.py for an interactive overview:
- Tables by Schema — bar chart of table counts per schema
- Quality Distribution — per-table scores and bucket breakdown (Good/Fair/Poor)
- Undocumented Tables — tables missing descriptions
- PII Columns — classified columns grouped by classification type
-
Search — interactive search via the
search_querywidget - Summary — key metrics in a single table
The dashboard re-scans the catalog each time it runs. For large catalogs, consider caching the snapshot to a Delta table and reading from there.
Troubleshooting
"Table or view not found" errors
- Ensure the cluster has Unity Catalog enabled
- Verify you have
USE CATALOGandUSE SCHEMApermissions - Check that the catalog/schema names in
catalog_config.yamlare correct
Lineage queries fail
-
system.access.table_lineagerequires a Premium or Enterprise workspace - The system tables may take up to 24 hours to populate after enabling
- Verify with:
SELECT * FROM system.access.table_lineage LIMIT 5
Slow scans
- Disable
include_row_countsif enabled - Narrow
schemasto specific targets instead of scanning all - Add exclusion patterns for large staging/temp tables
- Use a single-node cluster (metadata queries don't benefit from parallelism)
Classification false positives
- Make column patterns more specific (e.g.
^email$instead ofemail) - Order rules from most specific to most general
- Add negative patterns or post-process to remove false matches
Quality scores seem wrong
- Completeness = 50: The scorer couldn't query the table (permission or connectivity issue)
-
Freshness = 50: No
updated_atorcreated_attimestamp available - Check dimension details via
score.get_dimension("documentation").details
By Datanest Digital | Version 1.0.0 | $39
This is 1 of 11 resources in the Data Pipeline Pro toolkit. Get the complete [Data Catalog Builder] with all files, templates, and documentation for $39.
Or grab the entire Data Pipeline Pro bundle (11 products) for $169 — save 30%.
Top comments (0)