DatanestDigital

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Data Catalog Builder — Guide

#data #dataengineering #etl #python

Data Catalog Builder — Guide

Automated metadata discovery, classification, quality scoring, and documentation for Databricks Unity Catalog.

Prerequisites
Scanner Configuration
Running the Catalog Scan
Metadata Enrichment
Column Classification
Quality Scoring
Report Generation
Search Index
Dashboard Usage
Troubleshooting

Prerequisites

Before using Data Catalog Builder, ensure:

Databricks Runtime 13.3+ with Unity Catalog enabled
Cluster access to the catalogs you want to scan
Python packages: pyyaml, jinja2 (both included in DBR by default)
Permissions: USE CATALOG, USE SCHEMA, and SELECT on information_schema / target tables
For lineage features: access to system.access.table_lineage and system.access.column_lineage

Recommended Cluster Config

Setting	Value
Runtime	13.3 LTS or 14.x
Node type	Standard_DS3_v2 (or equivalent)
Workers	1–2 (metadata-only workload)
Unity Catalog	Enabled

Scanner Configuration

The scanner is driven by configs/catalog_config.yaml. Key settings:

Catalogs and Schemas

scanner:
  catalogs:
    - "main"
  schemas:
    - "bronze"
    - "silver"
    - "gold"

Leave schemas empty to scan all schemas in each catalog. This is useful for initial discovery but may be slow on catalogs with hundreds of schemas.

Exclusion Patterns

Filter out temporary and internal tables using glob-style patterns:

scanner:
  exclude_patterns:
    - "_tmp_*"
    - "_staging_*"
    - "__apply_changes*"
    - "_checkpoint_*"

Patterns are matched against the table name only (not the full catalog.schema.table path). The * wildcard matches any sequence of characters.

Row Counts

Setting include_row_counts: true runs COUNT(*) on every discovered table. This provides useful metadata but can be slow on large tables. Leave it false for initial scans, then enable selectively.

Running the Catalog Scan

Via Notebook

Open notebooks/scan_catalog.py and configure the widgets:

Widget	Default	Description
Config Path	`configs/catalog_config.yaml`	Scanner + enrichment config
Rules Path	`configs/classification_rules.yaml`	Column classification rules
Output Directory	`/Workspace/docs/catalog`	Where Markdown docs are written

Run all cells. The notebook executes the full pipeline:

Scan — discovers tables and columns
Enrich — applies tags and schema descriptions
Classify — labels PII/sensitive columns
Score — computes quality scores per table
Index — builds the searchable catalog
Report — generates Markdown documentation

Programmatically

from src.catalog_scanner import CatalogScanner
from src.metadata_enricher import MetadataEnricher
from src.quality_scorer import QualityScorer

scanner = CatalogScanner(config_path="configs/catalog_config.yaml")
snapshot = scanner.scan_all()

enricher = MetadataEnricher(config_path="configs/catalog_config.yaml")
enricher.apply_tags(snapshot)
enricher.classify_columns(snapshot)

scorer = QualityScorer()
scores = scorer.score_catalog(snapshot)

Metadata Enrichment

The MetadataEnricher adds business context to discovered metadata:

Tag Rules

Tags are assigned to tables based on their schema name:

enrichment:
  tag_rules:
    bronze: ["raw", "ingestion", "append-only"]
    silver: ["cleansed", "validated", "deduplicated"]
    gold:   ["curated", "business-ready", "aggregated"]

The enricher also infers medallion layer tags (bronze-layer, silver-layer, gold-layer) automatically from schema names.

Schema Descriptions

Tables without comments receive a default description based on their schema:

enrichment:
  schema_descriptions:
    bronze: "Raw ingested data"
    silver: "Cleansed and validated data"
    gold:   "Business-ready curated data"

Default Owner

Tables without an explicit owner are assigned default_owner (default: "data-engineering").

Column Classification

Column classification identifies sensitive data using pattern-matching rules defined in configs/classification_rules.yaml.

Rule Structure

rules:
  - name: "email_address"
    classification: "PII"
    column_patterns:
      - "email"
      - "e_mail"
      - "email_address"
    description: "Email address fields"

Each rule specifies:

classification: Label applied to matching columns (PII, SENSITIVE, CONFIDENTIAL)
column_patterns: Regex patterns matched against column names (case-insensitive)
type_patterns: Optional regex patterns matched against data types

The first matching rule wins. Order your rules from most specific to most general.

Built-in Rule Categories

Classification	Examples
PII	email, phone, SSN, address, date_of_birth
SENSITIVE	salary, credit_card, bank_account, password
CONFIDENTIAL	api_key, secret, token

Quality Scoring

Every table receives a 0–100 quality score based on four weighted dimensions:

Dimensions

Dimension	Weight	What It Measures
Documentation	30%	Table comment (40pts) + column comment coverage (60pts)
Completeness	30%	Average non-null rate across columns
Freshness	20%	Time since last update vs. threshold (default 24h)
Schema	20%	snake_case naming (20pts) + no generics (20pts) + type diversity (60pts)

Configuration

quality:
  weights:
    documentation: 0.30
    completeness: 0.30
    freshness: 0.20
    schema: 0.20
  freshness_threshold_hours: 24

Weights must sum to 1.0. Adjust to match your organisation's priorities. For example, if documentation is your top concern, increase its weight.

Interpreting Scores

Range	Label	Action
80–100	Good	Meets standards
50–79	Fair	Needs improvement
0–49	Poor	Requires immediate attention

Report Generation

The CatalogReporter generates per-table Markdown files using a Jinja2 template.

Template

The default template at configs/templates/table_doc.md.j2 renders:

Table name, type, owner, and description
Column table with name, type, nullable, comment, and classification
Quality score breakdown
Generation timestamp

Customise the template to match your organisation's documentation standards.

Output

Generated files are written to the configured output_dir (default: /Workspace/docs/catalog):

/Workspace/docs/catalog/
├── index.md              # Catalog index with all tables
├── raw_customers.md      # Per-table documentation
├── orders_cleansed.md
└── customer_summary.md

Search Index

The SearchIndex provides fast keyword search across all catalog metadata.

What's Indexed

Table names
Table descriptions (comments)
Tags
Column names
Column descriptions
Column classifications

Search API

from src.search_index import SearchIndex

index = SearchIndex()
index.build(snapshot)

# Multi-word search (all terms must match)
results = index.search("customer email")

for r in results:
    print(f"{r.table_full_name}.{r.column_name} [{r.match_field}] score={r.score}")

Relevance Scoring

Results are ranked by:

Exact match: 10 points
Starts-with: 5 points
Contains: 1 point

Multiplied by field weights: table_name (3x), tag (2.5x), column_name (2x), classification (2x), descriptions (1x).

Dashboard Usage

Open notebooks/catalog_dashboard.py for an interactive overview:

Tables by Schema — bar chart of table counts per schema
Quality Distribution — per-table scores and bucket breakdown (Good/Fair/Poor)
Undocumented Tables — tables missing descriptions
PII Columns — classified columns grouped by classification type
Search — interactive search via the search_query widget
Summary — key metrics in a single table

The dashboard re-scans the catalog each time it runs. For large catalogs, consider caching the snapshot to a Delta table and reading from there.

Troubleshooting

"Table or view not found" errors

Ensure the cluster has Unity Catalog enabled
Verify you have USE CATALOG and USE SCHEMA permissions
Check that the catalog/schema names in catalog_config.yaml are correct

Lineage queries fail

system.access.table_lineage requires a Premium or Enterprise workspace
The system tables may take up to 24 hours to populate after enabling
Verify with: SELECT * FROM system.access.table_lineage LIMIT 5

Slow scans

Disable include_row_counts if enabled
Narrow schemas to specific targets instead of scanning all
Add exclusion patterns for large staging/temp tables
Use a single-node cluster (metadata queries don't benefit from parallelism)

Classification false positives

Make column patterns more specific (e.g. ^email$ instead of email)
Order rules from most specific to most general
Add negative patterns or post-process to remove false matches

Quality scores seem wrong

Completeness = 50: The scorer couldn't query the table (permission or connectivity issue)
Freshness = 50: No updated_at or created_at timestamp available
Check dimension details via score.get_dimension("documentation").details

By Datanest Digital | Version 1.0.0 | $39

This is 1 of 11 resources in the Data Pipeline Pro toolkit. Get the complete [Data Catalog Builder] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire Data Pipeline Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

DEV Community

Data Catalog Builder — Guide

Data Catalog Builder — Guide

Table of Contents

Prerequisites

Recommended Cluster Config

Scanner Configuration

Catalogs and Schemas

Exclusion Patterns

Row Counts

Running the Catalog Scan

Via Notebook

Programmatically

Metadata Enrichment

Tag Rules

Schema Descriptions

Default Owner

Column Classification

Rule Structure

Built-in Rule Categories

Quality Scoring

Dimensions

Configuration

Interpreting Scores

Report Generation

Template

Output

Search Index

What's Indexed

Search API

Relevance Scoring

Dashboard Usage

Troubleshooting

"Table or view not found" errors

Lineage queries fail

Slow scans

Classification false positives

Quality scores seem wrong

Related Articles

Top comments (0)