DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Data Catalog Builder — Guide

Data Catalog Builder — Guide

Automated metadata discovery, classification, quality scoring, and documentation for Databricks Unity Catalog.


Table of Contents

  1. Prerequisites
  2. Scanner Configuration
  3. Running the Catalog Scan
  4. Metadata Enrichment
  5. Column Classification
  6. Quality Scoring
  7. Report Generation
  8. Search Index
  9. Dashboard Usage
  10. Troubleshooting

Prerequisites

Before using Data Catalog Builder, ensure:

  • Databricks Runtime 13.3+ with Unity Catalog enabled
  • Cluster access to the catalogs you want to scan
  • Python packages: pyyaml, jinja2 (both included in DBR by default)
  • Permissions: USE CATALOG, USE SCHEMA, and SELECT on information_schema / target tables
  • For lineage features: access to system.access.table_lineage and system.access.column_lineage

Recommended Cluster Config

Setting Value
Runtime 13.3 LTS or 14.x
Node type Standard_DS3_v2 (or equivalent)
Workers 1–2 (metadata-only workload)
Unity Catalog Enabled

Scanner Configuration

The scanner is driven by configs/catalog_config.yaml. Key settings:

Catalogs and Schemas

scanner:
  catalogs:
    - "main"
  schemas:
    - "bronze"
    - "silver"
    - "gold"
Enter fullscreen mode Exit fullscreen mode

Leave schemas empty to scan all schemas in each catalog. This is useful for initial discovery but may be slow on catalogs with hundreds of schemas.

Exclusion Patterns

Filter out temporary and internal tables using glob-style patterns:

scanner:
  exclude_patterns:
    - "_tmp_*"
    - "_staging_*"
    - "__apply_changes*"
    - "_checkpoint_*"
Enter fullscreen mode Exit fullscreen mode

Patterns are matched against the table name only (not the full catalog.schema.table path). The * wildcard matches any sequence of characters.

Row Counts

Setting include_row_counts: true runs COUNT(*) on every discovered table. This provides useful metadata but can be slow on large tables. Leave it false for initial scans, then enable selectively.


Running the Catalog Scan

Via Notebook

Open notebooks/scan_catalog.py and configure the widgets:

Widget Default Description
Config Path configs/catalog_config.yaml Scanner + enrichment config
Rules Path configs/classification_rules.yaml Column classification rules
Output Directory /Workspace/docs/catalog Where Markdown docs are written

Run all cells. The notebook executes the full pipeline:

  1. Scan — discovers tables and columns
  2. Enrich — applies tags and schema descriptions
  3. Classify — labels PII/sensitive columns
  4. Score — computes quality scores per table
  5. Index — builds the searchable catalog
  6. Report — generates Markdown documentation

Programmatically

from src.catalog_scanner import CatalogScanner
from src.metadata_enricher import MetadataEnricher
from src.quality_scorer import QualityScorer

scanner = CatalogScanner(config_path="configs/catalog_config.yaml")
snapshot = scanner.scan_all()

enricher = MetadataEnricher(config_path="configs/catalog_config.yaml")
enricher.apply_tags(snapshot)
enricher.classify_columns(snapshot)

scorer = QualityScorer()
scores = scorer.score_catalog(snapshot)
Enter fullscreen mode Exit fullscreen mode

Metadata Enrichment

The MetadataEnricher adds business context to discovered metadata:

Tag Rules

Tags are assigned to tables based on their schema name:

enrichment:
  tag_rules:
    bronze: ["raw", "ingestion", "append-only"]
    silver: ["cleansed", "validated", "deduplicated"]
    gold:   ["curated", "business-ready", "aggregated"]
Enter fullscreen mode Exit fullscreen mode

The enricher also infers medallion layer tags (bronze-layer, silver-layer, gold-layer) automatically from schema names.

Schema Descriptions

Tables without comments receive a default description based on their schema:

enrichment:
  schema_descriptions:
    bronze: "Raw ingested data"
    silver: "Cleansed and validated data"
    gold:   "Business-ready curated data"
Enter fullscreen mode Exit fullscreen mode

Default Owner

Tables without an explicit owner are assigned default_owner (default: "data-engineering").


Column Classification

Column classification identifies sensitive data using pattern-matching rules defined in configs/classification_rules.yaml.

Rule Structure

rules:
  - name: "email_address"
    classification: "PII"
    column_patterns:
      - "email"
      - "e_mail"
      - "email_address"
    description: "Email address fields"
Enter fullscreen mode Exit fullscreen mode

Each rule specifies:

  • classification: Label applied to matching columns (PII, SENSITIVE, CONFIDENTIAL)
  • column_patterns: Regex patterns matched against column names (case-insensitive)
  • type_patterns: Optional regex patterns matched against data types

The first matching rule wins. Order your rules from most specific to most general.

Built-in Rule Categories

Classification Examples
PII email, phone, SSN, address, date_of_birth
SENSITIVE salary, credit_card, bank_account, password
CONFIDENTIAL api_key, secret, token

Quality Scoring

Every table receives a 0–100 quality score based on four weighted dimensions:

Dimensions

Dimension Weight What It Measures
Documentation 30% Table comment (40pts) + column comment coverage (60pts)
Completeness 30% Average non-null rate across columns
Freshness 20% Time since last update vs. threshold (default 24h)
Schema 20% snake_case naming (20pts) + no generics (20pts) + type diversity (60pts)

Configuration

quality:
  weights:
    documentation: 0.30
    completeness: 0.30
    freshness: 0.20
    schema: 0.20
  freshness_threshold_hours: 24
Enter fullscreen mode Exit fullscreen mode

Weights must sum to 1.0. Adjust to match your organisation's priorities. For example, if documentation is your top concern, increase its weight.

Interpreting Scores

Range Label Action
80–100 Good Meets standards
50–79 Fair Needs improvement
0–49 Poor Requires immediate attention

Report Generation

The CatalogReporter generates per-table Markdown files using a Jinja2 template.

Template

The default template at configs/templates/table_doc.md.j2 renders:

  • Table name, type, owner, and description
  • Column table with name, type, nullable, comment, and classification
  • Quality score breakdown
  • Generation timestamp

Customise the template to match your organisation's documentation standards.

Output

Generated files are written to the configured output_dir (default: /Workspace/docs/catalog):

/Workspace/docs/catalog/
├── index.md              # Catalog index with all tables
├── raw_customers.md      # Per-table documentation
├── orders_cleansed.md
└── customer_summary.md
Enter fullscreen mode Exit fullscreen mode

Search Index

The SearchIndex provides fast keyword search across all catalog metadata.

What's Indexed

  • Table names
  • Table descriptions (comments)
  • Tags
  • Column names
  • Column descriptions
  • Column classifications

Search API

from src.search_index import SearchIndex

index = SearchIndex()
index.build(snapshot)

# Multi-word search (all terms must match)
results = index.search("customer email")

for r in results:
    print(f"{r.table_full_name}.{r.column_name} [{r.match_field}] score={r.score}")
Enter fullscreen mode Exit fullscreen mode

Relevance Scoring

Results are ranked by:

  1. Exact match: 10 points
  2. Starts-with: 5 points
  3. Contains: 1 point

Multiplied by field weights: table_name (3x), tag (2.5x), column_name (2x), classification (2x), descriptions (1x).


Dashboard Usage

Open notebooks/catalog_dashboard.py for an interactive overview:

  1. Tables by Schema — bar chart of table counts per schema
  2. Quality Distribution — per-table scores and bucket breakdown (Good/Fair/Poor)
  3. Undocumented Tables — tables missing descriptions
  4. PII Columns — classified columns grouped by classification type
  5. Search — interactive search via the search_query widget
  6. Summary — key metrics in a single table

The dashboard re-scans the catalog each time it runs. For large catalogs, consider caching the snapshot to a Delta table and reading from there.


Troubleshooting

"Table or view not found" errors

  • Ensure the cluster has Unity Catalog enabled
  • Verify you have USE CATALOG and USE SCHEMA permissions
  • Check that the catalog/schema names in catalog_config.yaml are correct

Lineage queries fail

  • system.access.table_lineage requires a Premium or Enterprise workspace
  • The system tables may take up to 24 hours to populate after enabling
  • Verify with: SELECT * FROM system.access.table_lineage LIMIT 5

Slow scans

  • Disable include_row_counts if enabled
  • Narrow schemas to specific targets instead of scanning all
  • Add exclusion patterns for large staging/temp tables
  • Use a single-node cluster (metadata queries don't benefit from parallelism)

Classification false positives

  • Make column patterns more specific (e.g. ^email$ instead of email)
  • Order rules from most specific to most general
  • Add negative patterns or post-process to remove false matches

Quality scores seem wrong

  • Completeness = 50: The scorer couldn't query the table (permission or connectivity issue)
  • Freshness = 50: No updated_at or created_at timestamp available
  • Check dimension details via score.get_dimension("documentation").details

By Datanest Digital | Version 1.0.0 | $39


This is 1 of 11 resources in the Data Pipeline Pro toolkit. Get the complete [Data Catalog Builder] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire Data Pipeline Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)