DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Databricks Starter Kit

Databricks Starter Kit

Production-ready templates for building data platforms on Databricks with Unity Catalog and Delta Lake.

Skip the months of trial and error. This kit gives you the same patterns and architecture used by data platform teams at scale — fully documented, customizable, and ready to deploy.


What's Inside

Module What It Does
config/ Environment detection, secret management, structured logging
medallion_bootstrap/ One-command setup of your bronze/silver/gold catalog structure with RBAC
ingestion_templates/ Battle-tested pipelines for APIs, databases, files, and streaming
cicd_templates/ Azure DevOps & GitHub Actions pipelines, deployment scripts, test runner
unity_catalog_setup/ SQL scripts for catalogs, external locations, credentials, and governance

21 files — every one fully runnable, type-hinted, and documented.


Quick Start

1. Upload to Databricks

Upload this kit to a Databricks Repo or the workspace file system:

/Repos/<your-user>/databricks-starter-kit/
Enter fullscreen mode Exit fullscreen mode

Or use the Databricks CLI:

databricks repos create \
  --url https://github.com/your-org/databricks-starter-kit \
  --provider github
Enter fullscreen mode Exit fullscreen mode

2. Configure Your Environment

Edit config/environment.py to register your workspace IDs:

WORKSPACE_ENV_MAP: dict[str, str] = {
    "1234567890123456": "dev",      # Your dev workspace org ID
    "2345678901234567": "staging",  # Your staging workspace
    "3456789012345678": "prod",     # Your production workspace
}
Enter fullscreen mode Exit fullscreen mode

Update storage account names and catalog prefixes to match your naming conventions.

3. Bootstrap the Medallion Architecture

Edit medallion_bootstrap/config.py with your catalog names and team groups, then run:

# In a Databricks notebook — run these in order:
%run ./medallion_bootstrap/01_create_catalogs
%run ./medallion_bootstrap/02_create_schemas
%run ./medallion_bootstrap/03_grant_permissions
Enter fullscreen mode Exit fullscreen mode

This creates your full bronze → silver → gold catalog structure with proper RBAC in minutes.

4. Run Your First Ingestion

from ingestion_templates.api_ingestion import ApiIngestionPipeline, ApiSourceConfig

pipeline = ApiIngestionPipeline(
    pipeline_name="demo_api_load",
    source_system="demo_api",
    source_class="users",
    api_config=ApiSourceConfig(
        base_url="https://jsonplaceholder.typicode.com",
        endpoint="/users",
    ),
)
pipeline.run()
Enter fullscreen mode Exit fullscreen mode

File Structure

databricks-starter-kit/
│
├── README.md                          # This file
├── LICENSE                            # MIT License
│
├── config/
│   ├── environment.py                 # Environment detection & workspace config
│   ├── secrets.py                     # Secret management (Key Vault, scopes, env vars)
│   └── logging_config.py             # Structured logging with Delta table sink
│
├── medallion_bootstrap/
│   ├── config.py                      # Catalog, schema, and permission definitions
│   ├── 01_create_catalogs.py          # Create bronze/silver/gold Unity Catalogs
│   ├── 02_create_schemas.py           # Create schemas within each catalog
│   └── 03_grant_permissions.py        # Grant RBAC permissions to groups
│
├── ingestion_templates/
│   ├── base_pipeline.py               # Abstract base class — merge, append, SCD2, dedup
│   ├── api_ingestion.py               # REST API ingestion with pagination & retry
│   ├── database_ingestion.py          # JDBC ingestion (SQL Server, PostgreSQL, Oracle, MySQL)
│   ├── file_ingestion.py              # Auto Loader (cloudFiles) for CSV/JSON/Parquet/Avro/XML
│   └── streaming_ingestion.py         # Event Hub & Kafka streaming with checkpoints
│
├── cicd_templates/
│   ├── azure_devops_pipeline.yml      # Multi-stage Azure DevOps pipeline
│   ├── github_actions_workflow.yml    # GitHub Actions with OIDC & environment protection
│   ├── deploy_notebooks.py            # Deploy workflows & notebooks via REST API
│   └── run_tests.py                   # Test runner — tag validation, structure checks, secrets scan
│
└── unity_catalog_setup/
    ├── setup_catalogs.sql             # SQL to create catalogs with tags and properties
    ├── setup_external_locations.sql   # External locations for each data layer
    ├── setup_credentials.sql          # Storage credentials (Azure MI, SP, AWS IAM, GCP SA)
    └── data_governance_policies.md    # Governance playbook — RBAC, classification, DQ, audit
Enter fullscreen mode Exit fullscreen mode

Module Deep Dives

config/ — Environment, Secrets & Logging

environment.py — Auto-detects which Databricks workspace you're running in and returns the correct environment config (dev/staging/prod). Uses the workspace org ID for detection, with a serverless-compatible fallback via workspace URL parsing.

from config.environment import get_environment

env = get_environment()
print(env.name)             # "dev"
print(env.catalog_prefix)   # "dev"
print(env.storage.raw)      # "abfss://raw@styourorgdev.dfs.core.windows.net"
Enter fullscreen mode Exit fullscreen mode

secrets.py — Unified secret access with layered lookup: Databricks Secret Scope → Azure Key Vault → environment variables. Includes JDBC connection string builder and in-memory caching with TTL.

from config.secrets import SecretManager

secrets = SecretManager()
api_key = secrets.get("my-api-key")
jdbc_url = secrets.jdbc_url("my-database", driver="sqlserver")
Enter fullscreen mode Exit fullscreen mode

logging_config.py — Structured logging with correlation IDs for tracing a pipeline run across stages. Optionally flushes logs to a Delta table for persistent audit trail and dashboards.

from config.logging_config import PipelineLogger

logger = PipelineLogger(pipeline_name="crm_sync", environment="prod")
logger.info("Starting extraction", extra={"table": "accounts"})
logger.log_metric("rows_processed", 152_000)
logger.flush_to_delta("audit.pipeline_logs")
Enter fullscreen mode Exit fullscreen mode

medallion_bootstrap/ — Catalog Setup in Minutes

Creates your entire Unity Catalog structure — catalogs, schemas, and permissions — from a single configuration file.

What it creates:

Layer Catalog Schemas (customizable)
Bronze {env}_bronze raw, cdc, streaming
Silver {env}_silver cleansed, conformed, enriched
Gold {env}_gold analytics, reporting, features

Permissions are preset-based — assign groups like data_readers, data_engineers, data_admins with sensible defaults or customize granularly.

Edit medallion_bootstrap/config.py once, then the three scripts handle the rest idempotently (safe to re-run).


ingestion_templates/ — Five Pipelines, One Base Class

All pipelines extend BasePipeline, which provides:

  • Write modes: append, merge (upsert), overwrite, scd2 (Slowly Changing Dimension Type 2)
  • Metadata columns: _etl_load_timestamp_utc, _source_system, _source_filename added automatically
  • Deduplication: Window-function-based dedup with configurable key and ordering columns
  • Column sanitizer: fix_column_names() replaces spaces, dots, and special characters
  • Delta table init: Auto-creates target tables with configurable properties

API Ingestion (api_ingestion.py)

REST API ingestion with:

  • Pagination: offset, cursor, and link-header strategies
  • Retry with exponential backoff
  • Content hashing for change detection
  • Rate limiting
  • Incremental and full-refresh modes
pipeline = ApiIngestionPipeline(
    pipeline_name="salesforce_accounts",
    source_system="salesforce",
    source_class="accounts",
    api_config=ApiSourceConfig(
        base_url="https://your-instance.salesforce.com",
        endpoint="/services/data/v58.0/query",
        auth_type="bearer",
        pagination_type="cursor",
        cursor_field="nextRecordsUrl",
    ),
)
pipeline.run()
Enter fullscreen mode Exit fullscreen mode

Database Ingestion (database_ingestion.py)

JDBC ingestion with:

  • Auto driver detection for SQL Server, PostgreSQL, Oracle, MySQL
  • Full and incremental load modes (watermark-based)
  • Parallel partitioned reads for large tables
  • String trimming and type coercion
pipeline = DatabaseIngestionPipeline(
    pipeline_name="erp_customers",
    source_system="erp",
    source_class="customers",
    jdbc_config=JdbcSourceConfig(
        host="erp-sql-server.database.windows.net",
        database="erp_prod",
        schema="dbo",
        table="Customers",
        driver_type="sqlserver",
        load_mode="incremental",
        watermark_column="ModifiedDate",
    ),
)
pipeline.run()
Enter fullscreen mode Exit fullscreen mode

File Ingestion (file_ingestion.py)

Cloud Files (Auto Loader) ingestion with:

  • Formats: CSV, JSON, Parquet, Avro, XML
  • Schema inference with evolution support
  • foreachBatch writer for merge/upsert targets
  • Trigger-once (batch) and continuous streaming modes
pipeline = FileIngestionPipeline(
    pipeline_name="landing_csvs",
    source_system="external_vendor",
    source_class="transactions",
    file_config=FileSourceConfig(
        source_path="abfss://landing@storage.dfs.core.windows.net/vendor/",
        file_format="csv",
        header=True,
        trigger_mode="trigger_once",
    ),
)
pipeline.run()
Enter fullscreen mode Exit fullscreen mode

Streaming Ingestion (streaming_ingestion.py)

Event Hub and Kafka ingestion with:

  • Event Hub auth: OAuth (Managed Identity) or connection string
  • Kafka auth: SASL/SSL, plaintext, or OAuth
  • Checkpoint management for exactly-once semantics
  • JSON value parsing with schema
  • Configurable triggers and watermarks
from ingestion_templates.streaming_ingestion import (
    StreamingIngestionPipeline,
    EventHubConfig,
)

pipeline = StreamingIngestionPipeline(
    pipeline_name="clickstream_events",
    source_system="event_hub",
    source_class="clickstream",
    streaming_config=EventHubConfig(
        namespace="your-eventhub-namespace",
        topic="clickstream",
        consumer_group="databricks-consumer",
        auth_type="oauth",
    ),
)
pipeline.run()
Enter fullscreen mode Exit fullscreen mode

cicd_templates/ — Ship With Confidence

Azure DevOps (azure_devops_pipeline.yml)

Multi-stage pipeline with:

  • Service principal authentication
  • Git-diff-based workflow deployment (only deploys what changed)
  • Test stage with ruff linting and pytest
  • Manual approval gate for production
  • Parameterized for dev/staging/prod

GitHub Actions (github_actions_workflow.yml)

Workflow with:

  • OIDC authentication (no stored secrets)
  • Environment protection rules
  • PR validation (lint + test)
  • Matrix deployment across environments

Deployment Script (deploy_notebooks.py)

CLI tool that:

  • Deploys workflow definitions from JSON files via the Jobs API 2.1
  • Updates Databricks Repos to a specific branch or tag
  • Imports individual notebooks via the Workspace API
  • Supports dry-run mode for previewing changes
# Deploy a workflow
python deploy_notebooks.py deploy-workflow \
  --workspace-url https://adb-123.azuredatabricks.net \
  --file workflows/my_pipeline.json

# Dry run
python deploy_notebooks.py deploy-workflow \
  --file workflows/my_pipeline.json \
  --dry-run
Enter fullscreen mode Exit fullscreen mode

Test Runner (run_tests.py)

Validates your Databricks project:

  • Workflow tag validation (required tags, valid values)
  • JSON structure validation
  • Hardcoded secret detection (scans for API keys, passwords, tokens)
  • Python syntax checking
  • JUnit XML output for CI integration
python run_tests.py --workflow-dir workflows/ --output results.xml
Enter fullscreen mode Exit fullscreen mode

unity_catalog_setup/ — Governance From Day One

SQL scripts and a governance playbook for Unity Catalog:

  • setup_catalogs.sql — Creates bronze/silver/gold/audit catalogs with tags and default schemas
  • setup_external_locations.sql — External locations for each layer (raw, bronze, silver, gold, checkpoints, export)
  • setup_credentials.sql — Storage credential templates for Azure Managed Identity, Service Principal, AWS IAM Role, and GCP Service Account
  • data_governance_policies.md — Complete governance playbook covering:
    • RBAC model with group hierarchy
    • Data classification tiers (public, internal, confidential, restricted) with tagging SQL
    • Row-level and column-level security patterns
    • Data retention and VACUUM policies
    • Data quality constraints and quarantine patterns
    • Audit queries for tracking access
    • Compliance checklist

Customization Guide

Naming Conventions

Every naming pattern is configurable. Key places to update:

What Where Default Pattern
Catalog names medallion_bootstrap/config.py {env}_bronze, {env}_silver, {env}_gold
Schema names medallion_bootstrap/config.py raw, cleansed, analytics, etc.
Storage accounts config/environment.py styourorg{env}
Secret scope config/secrets.py pipeline_secrets
Log table config/logging_config.py audit.pipeline_logs

Adding a New Ingestion Pipeline

  1. Create a new file in ingestion_templates/
  2. Extend BasePipeline:
from ingestion_templates.base_pipeline import BasePipeline

class MyCustomPipeline(BasePipeline):
    def extract(self) -> DataFrame:
        # Your extraction logic
        ...

    def transform(self, df: DataFrame) -> DataFrame:
        # Your transformation logic (optional override)
        ...
Enter fullscreen mode Exit fullscreen mode
  1. The base class handles Delta table creation, metadata columns, deduplication, and write modes for you.

Adding a New Environment

In config/environment.py, add your workspace org ID to WORKSPACE_ENV_MAP and create a corresponding EnvironmentConfig in the _build_environments() function.

Switching Cloud Providers

The kit is designed for Azure by default but is adaptable:

  • AWS: Update storage paths from abfss:// to s3://, swap Key Vault references for AWS Secrets Manager, use setup_credentials.sql AWS IAM Role template
  • GCP: Update storage paths to gs://, swap for GCP Secret Manager, use the GCP Service Account credential template

Requirements

Requirement Version
Databricks Runtime 13.x or later
Python 3.10+
Unity Catalog Enabled on workspace
Delta Lake Included with DBR 13.x+

Optional Dependencies

  • azure-identity — For direct Key Vault access (if not using Databricks-backed scopes)
  • requests — For API ingestion pipelines (pre-installed on DBR)
  • databricks-sdk — For deployment scripts

Frequently Asked Questions

Can I use this without Unity Catalog?
The ingestion templates and config modules work without UC. The medallion bootstrap and unity catalog setup modules require UC to be enabled.

Does this work on AWS or GCP Databricks?
Yes. The core patterns (Delta Lake, Spark, structured streaming) are cloud-agnostic. Storage paths and credential setup need adjustment — see the Customization Guide above.

Can I use this with Databricks Community Edition?
Partially. The ingestion templates and config modules will work. Unity Catalog features and some streaming connectors require a paid workspace.

What's the difference between medallion_bootstrap and unity_catalog_setup?
medallion_bootstrap/ contains Python scripts that create catalogs, schemas, and permissions programmatically — ideal for automation. unity_catalog_setup/ contains SQL scripts and a governance playbook — ideal for running interactively or as part of a Terraform/Pulumi workflow.



This is 1 of 6 resources in the DataStack Pro toolkit. Get the complete [Databricks Starter Kit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire DataStack Pro bundle (6 products) for $164 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)