DatanestDigital

Posted on Mar 23 • Edited on Jun 17

Databricks Starter Kit

#databricks #pyspark #deltalake #dataengineering

Databricks Starter Kit

Production-ready templates for building data platforms on Databricks with Unity Catalog and Delta Lake.

Skip the months of trial and error. This kit gives you the same patterns and architecture used by data platform teams at scale — fully documented, customizable, and ready to deploy.

What's Inside

Module	What It Does
config/	Environment detection, secret management, structured logging
medallion_bootstrap/	One-command setup of your bronze/silver/gold catalog structure with RBAC
ingestion_templates/	Battle-tested pipelines for APIs, databases, files, and streaming
cicd_templates/	Azure DevOps & GitHub Actions pipelines, deployment scripts, test runner
unity_catalog_setup/	SQL scripts for catalogs, external locations, credentials, and governance

21 files — every one fully runnable, type-hinted, and documented.

Quick Start

1. Upload to Databricks

Upload this kit to a Databricks Repo or the workspace file system:

/Repos/<your-user>/databricks-starter-kit/

Or use the Databricks CLI:

databricks repos create \
  --url https://github.com/your-org/databricks-starter-kit \
  --provider github

2. Configure Your Environment

Edit config/environment.py to register your workspace IDs:

WORKSPACE_ENV_MAP: dict[str, str] = {
    "1234567890123456": "dev",      # Your dev workspace org ID
    "2345678901234567": "staging",  # Your staging workspace
    "3456789012345678": "prod",     # Your production workspace
}

Update storage account names and catalog prefixes to match your naming conventions.

3. Bootstrap the Medallion Architecture

Edit medallion_bootstrap/config.py with your catalog names and team groups, then run:

# In a Databricks notebook — run these in order:
%run ./medallion_bootstrap/01_create_catalogs
%run ./medallion_bootstrap/02_create_schemas
%run ./medallion_bootstrap/03_grant_permissions

This creates your full bronze → silver → gold catalog structure with proper RBAC in minutes.

4. Run Your First Ingestion

from ingestion_templates.api_ingestion import ApiIngestionPipeline, ApiSourceConfig

pipeline = ApiIngestionPipeline(
    pipeline_name="demo_api_load",
    source_system="demo_api",
    source_class="users",
    api_config=ApiSourceConfig(
        base_url="https://jsonplaceholder.typicode.com",
        endpoint="/users",
    ),
)
pipeline.run()

File Structure

databricks-starter-kit/
│
├── README.md                          # This file
├── LICENSE                            # MIT License
│
├── config/
│   ├── environment.py                 # Environment detection & workspace config
│   ├── secrets.py                     # Secret management (Key Vault, scopes, env vars)
│   └── logging_config.py             # Structured logging with Delta table sink
│
├── medallion_bootstrap/
│   ├── config.py                      # Catalog, schema, and permission definitions
│   ├── 01_create_catalogs.py          # Create bronze/silver/gold Unity Catalogs
│   ├── 02_create_schemas.py           # Create schemas within each catalog
│   └── 03_grant_permissions.py        # Grant RBAC permissions to groups
│
├── ingestion_templates/
│   ├── base_pipeline.py               # Abstract base class — merge, append, SCD2, dedup
│   ├── api_ingestion.py               # REST API ingestion with pagination & retry
│   ├── database_ingestion.py          # JDBC ingestion (SQL Server, PostgreSQL, Oracle, MySQL)
│   ├── file_ingestion.py              # Auto Loader (cloudFiles) for CSV/JSON/Parquet/Avro/XML
│   └── streaming_ingestion.py         # Event Hub & Kafka streaming with checkpoints
│
├── cicd_templates/
│   ├── azure_devops_pipeline.yml      # Multi-stage Azure DevOps pipeline
│   ├── github_actions_workflow.yml    # GitHub Actions with OIDC & environment protection
│   ├── deploy_notebooks.py            # Deploy workflows & notebooks via REST API
│   └── run_tests.py                   # Test runner — tag validation, structure checks, secrets scan
│
└── unity_catalog_setup/
    ├── setup_catalogs.sql             # SQL to create catalogs with tags and properties
    ├── setup_external_locations.sql   # External locations for each data layer
    ├── setup_credentials.sql          # Storage credentials (Azure MI, SP, AWS IAM, GCP SA)
    └── data_governance_policies.md    # Governance playbook — RBAC, classification, DQ, audit

Module Deep Dives

config/ — Environment, Secrets & Logging

environment.py — Auto-detects which Databricks workspace you're running in and returns the correct environment config (dev/staging/prod). Uses the workspace org ID for detection, with a serverless-compatible fallback via workspace URL parsing.

from config.environment import get_environment

env = get_environment()
print(env.name)             # "dev"
print(env.catalog_prefix)   # "dev"
print(env.storage.raw)      # "abfss://raw@styourorgdev.dfs.core.windows.net"

secrets.py — Unified secret access with layered lookup: Databricks Secret Scope → Azure Key Vault → environment variables. Includes JDBC connection string builder and in-memory caching with TTL.

from config.secrets import SecretManager

secrets = SecretManager()
api_key = secrets.get("my-api-key")
jdbc_url = secrets.jdbc_url("my-database", driver="sqlserver")

logging_config.py — Structured logging with correlation IDs for tracing a pipeline run across stages. Optionally flushes logs to a Delta table for persistent audit trail and dashboards.

from config.logging_config import PipelineLogger

logger = PipelineLogger(pipeline_name="crm_sync", environment="prod")
logger.info("Starting extraction", extra={"table": "accounts"})
logger.log_metric("rows_processed", 152_000)
logger.flush_to_delta("audit.pipeline_logs")

medallion_bootstrap/ — Catalog Setup in Minutes

Creates your entire Unity Catalog structure — catalogs, schemas, and permissions — from a single configuration file.

What it creates:

Layer	Catalog	Schemas (customizable)
Bronze	`{env}_bronze`	`raw`, `cdc`, `streaming`
Silver	`{env}_silver`	`cleansed`, `conformed`, `enriched`
Gold	`{env}_gold`	`analytics`, `reporting`, `features`

Permissions are preset-based — assign groups like data_readers, data_engineers, data_admins with sensible defaults or customize granularly.

Edit medallion_bootstrap/config.py once, then the three scripts handle the rest idempotently (safe to re-run).

ingestion_templates/ — Five Pipelines, One Base Class

All pipelines extend BasePipeline, which provides:

Write modes: append, merge (upsert), overwrite, scd2 (Slowly Changing Dimension Type 2)
Metadata columns: _etl_load_timestamp_utc, _source_system, _source_filename added automatically
Deduplication: Window-function-based dedup with configurable key and ordering columns
Column sanitizer: fix_column_names() replaces spaces, dots, and special characters
Delta table init: Auto-creates target tables with configurable properties

API Ingestion (`api_ingestion.py`)

REST API ingestion with:

Pagination: offset, cursor, and link-header strategies
Retry with exponential backoff
Content hashing for change detection
Rate limiting
Incremental and full-refresh modes

pipeline = ApiIngestionPipeline(
    pipeline_name="salesforce_accounts",
    source_system="salesforce",
    source_class="accounts",
    api_config=ApiSourceConfig(
        base_url="https://your-instance.salesforce.com",
        endpoint="/services/data/v58.0/query",
        auth_type="bearer",
        pagination_type="cursor",
        cursor_field="nextRecordsUrl",
    ),
)
pipeline.run()

Database Ingestion (`database_ingestion.py`)

JDBC ingestion with:

Auto driver detection for SQL Server, PostgreSQL, Oracle, MySQL
Full and incremental load modes (watermark-based)
Parallel partitioned reads for large tables
String trimming and type coercion

pipeline = DatabaseIngestionPipeline(
    pipeline_name="erp_customers",
    source_system="erp",
    source_class="customers",
    jdbc_config=JdbcSourceConfig(
        host="erp-sql-server.database.windows.net",
        database="erp_prod",
        schema="dbo",
        table="Customers",
        driver_type="sqlserver",
        load_mode="incremental",
        watermark_column="ModifiedDate",
    ),
)
pipeline.run()

File Ingestion (`file_ingestion.py`)

Cloud Files (Auto Loader) ingestion with:

Formats: CSV, JSON, Parquet, Avro, XML
Schema inference with evolution support
foreachBatch writer for merge/upsert targets
Trigger-once (batch) and continuous streaming modes

pipeline = FileIngestionPipeline(
    pipeline_name="landing_csvs",
    source_system="external_vendor",
    source_class="transactions",
    file_config=FileSourceConfig(
        source_path="abfss://landing@storage.dfs.core.windows.net/vendor/",
        file_format="csv",
        header=True,
        trigger_mode="trigger_once",
    ),
)
pipeline.run()

Streaming Ingestion (`streaming_ingestion.py`)

Event Hub and Kafka ingestion with:

Event Hub auth: OAuth (Managed Identity) or connection string
Kafka auth: SASL/SSL, plaintext, or OAuth
Checkpoint management for exactly-once semantics
JSON value parsing with schema
Configurable triggers and watermarks

from ingestion_templates.streaming_ingestion import (
    StreamingIngestionPipeline,
    EventHubConfig,
)

pipeline = StreamingIngestionPipeline(
    pipeline_name="clickstream_events",
    source_system="event_hub",
    source_class="clickstream",
    streaming_config=EventHubConfig(
        namespace="your-eventhub-namespace",
        topic="clickstream",
        consumer_group="databricks-consumer",
        auth_type="oauth",
    ),
)
pipeline.run()

cicd_templates/ — Ship With Confidence

Azure DevOps (`azure_devops_pipeline.yml`)

Multi-stage pipeline with:

Service principal authentication
Git-diff-based workflow deployment (only deploys what changed)
Test stage with ruff linting and pytest
Manual approval gate for production
Parameterized for dev/staging/prod

GitHub Actions (`github_actions_workflow.yml`)

Workflow with:

OIDC authentication (no stored secrets)
Environment protection rules
PR validation (lint + test)
Matrix deployment across environments

Deployment Script (`deploy_notebooks.py`)

CLI tool that:

Deploys workflow definitions from JSON files via the Jobs API 2.1
Updates Databricks Repos to a specific branch or tag
Imports individual notebooks via the Workspace API
Supports dry-run mode for previewing changes

# Deploy a workflow
python deploy_notebooks.py deploy-workflow \
  --workspace-url https://adb-123.azuredatabricks.net \
  --file workflows/my_pipeline.json

# Dry run
python deploy_notebooks.py deploy-workflow \
  --file workflows/my_pipeline.json \
  --dry-run

Test Runner (`run_tests.py`)

Validates your Databricks project:

Workflow tag validation (required tags, valid values)
JSON structure validation
Hardcoded secret detection (scans for API keys, passwords, tokens)
Python syntax checking
JUnit XML output for CI integration

python run_tests.py --workflow-dir workflows/ --output results.xml

unity_catalog_setup/ — Governance From Day One

SQL scripts and a governance playbook for Unity Catalog:

setup_catalogs.sql — Creates bronze/silver/gold/audit catalogs with tags and default schemas
setup_external_locations.sql — External locations for each layer (raw, bronze, silver, gold, checkpoints, export)
setup_credentials.sql — Storage credential templates for Azure Managed Identity, Service Principal, AWS IAM Role, and GCP Service Account
data_governance_policies.md — Complete governance playbook covering:
- RBAC model with group hierarchy
- Data classification tiers (public, internal, confidential, restricted) with tagging SQL
- Row-level and column-level security patterns
- Data retention and VACUUM policies
- Data quality constraints and quarantine patterns
- Audit queries for tracking access
- Compliance checklist

Customization Guide

Naming Conventions

Every naming pattern is configurable. Key places to update:

What	Where	Default Pattern
Catalog names	`medallion_bootstrap/config.py`	`{env}_bronze`, `{env}_silver`, `{env}_gold`
Schema names	`medallion_bootstrap/config.py`	`raw`, `cleansed`, `analytics`, etc.
Storage accounts	`config/environment.py`	`styourorg{env}`
Secret scope	`config/secrets.py`	`pipeline_secrets`
Log table	`config/logging_config.py`	`audit.pipeline_logs`

Adding a New Ingestion Pipeline

Create a new file in ingestion_templates/
Extend BasePipeline:

from ingestion_templates.base_pipeline import BasePipeline

class MyCustomPipeline(BasePipeline):
    def extract(self) -> DataFrame:
        # Your extraction logic
        ...

    def transform(self, df: DataFrame) -> DataFrame:
        # Your transformation logic (optional override)
        ...

The base class handles Delta table creation, metadata columns, deduplication, and write modes for you.

Adding a New Environment

In config/environment.py, add your workspace org ID to WORKSPACE_ENV_MAP and create a corresponding EnvironmentConfig in the _build_environments() function.

Switching Cloud Providers

The kit is designed for Azure by default but is adaptable:

AWS: Update storage paths from abfss:// to s3://, swap Key Vault references for AWS Secrets Manager, use setup_credentials.sql AWS IAM Role template
GCP: Update storage paths to gs://, swap for GCP Secret Manager, use the GCP Service Account credential template

Requirements

Requirement	Version
Databricks Runtime	13.x or later
Python	3.10+
Unity Catalog	Enabled on workspace
Delta Lake	Included with DBR 13.x+

Optional Dependencies

azure-identity — For direct Key Vault access (if not using Databricks-backed scopes)
requests — For API ingestion pipelines (pre-installed on DBR)
databricks-sdk — For deployment scripts

Frequently Asked Questions

Can I use this without Unity Catalog?
The ingestion templates and config modules work without UC. The medallion bootstrap and unity catalog setup modules require UC to be enabled.

Does this work on AWS or GCP Databricks?
Yes. The core patterns (Delta Lake, Spark, structured streaming) are cloud-agnostic. Storage paths and credential setup need adjustment — see the Customization Guide above.

Can I use this with Databricks Community Edition?
Partially. The ingestion templates and config modules will work. Unity Catalog features and some streaming connectors require a paid workspace.

What's the difference between medallion_bootstrap and unity_catalog_setup?
medallion_bootstrap/ contains Python scripts that create catalogs, schemas, and permissions programmatically — ideal for automation. unity_catalog_setup/ contains SQL scripts and a governance playbook — ideal for running interactively or as part of a Terraform/Pulumi workflow.

This is part of Datanest Platform Pro — production-grade Databricks & Azure data-platform toolkits. Get the Databricks Notebook Framework with all files, templates, and documentation for $59.

Get the Full Kit →

Or grab the entire Databricks Mastery Bundle for $199 — the complete Databricks collection.

Get the Complete Bundle →

DEV Community

Databricks Starter Kit

Databricks Starter Kit

What's Inside

Quick Start

1. Upload to Databricks

2. Configure Your Environment

3. Bootstrap the Medallion Architecture

4. Run Your First Ingestion

File Structure

Module Deep Dives

config/ — Environment, Secrets & Logging

medallion_bootstrap/ — Catalog Setup in Minutes

ingestion_templates/ — Five Pipelines, One Base Class

API Ingestion (`api_ingestion.py`)

Database Ingestion (`database_ingestion.py`)

File Ingestion (`file_ingestion.py`)

Streaming Ingestion (`streaming_ingestion.py`)

cicd_templates/ — Ship With Confidence

Azure DevOps (`azure_devops_pipeline.yml`)

GitHub Actions (`github_actions_workflow.yml`)

Deployment Script (`deploy_notebooks.py`)

Test Runner (`run_tests.py`)

unity_catalog_setup/ — Governance From Day One

Customization Guide

Naming Conventions

Adding a New Ingestion Pipeline

Adding a New Environment

Switching Cloud Providers

Requirements

Optional Dependencies

Frequently Asked Questions

Related Articles

Top comments (0)

Databricks Starter Kit

What's Inside

Quick Start

1. Upload to Databricks

2. Configure Your Environment

3. Bootstrap the Medallion Architecture

4. Run Your First Ingestion

File Structure

Module Deep Dives

config/ — Environment, Secrets & Logging

medallion_bootstrap/ — Catalog Setup in Minutes

ingestion_templates/ — Five Pipelines, One Base Class

API Ingestion (api_ingestion.py)

Database Ingestion (database_ingestion.py)

File Ingestion (file_ingestion.py)

Streaming Ingestion (streaming_ingestion.py)

cicd_templates/ — Ship With Confidence

Azure DevOps (azure_devops_pipeline.yml)

GitHub Actions (github_actions_workflow.yml)

Deployment Script (deploy_notebooks.py)

Test Runner (run_tests.py)

unity_catalog_setup/ — Governance From Day One

Customization Guide

Naming Conventions

Adding a New Ingestion Pipeline

Adding a New Environment

Switching Cloud Providers

Requirements

Optional Dependencies

Frequently Asked Questions

Related Articles

API Ingestion (`api_ingestion.py`)

Database Ingestion (`database_ingestion.py`)

File Ingestion (`file_ingestion.py`)

Streaming Ingestion (`streaming_ingestion.py`)

Azure DevOps (`azure_devops_pipeline.yml`)

GitHub Actions (`github_actions_workflow.yml`)

Deployment Script (`deploy_notebooks.py`)

Test Runner (`run_tests.py`)