Databricks Starter Kit
Production-ready templates for building data platforms on Databricks with Unity Catalog and Delta Lake.
Skip the months of trial and error. This kit gives you the same patterns and architecture used by data platform teams at scale — fully documented, customizable, and ready to deploy.
What's Inside
| Module | What It Does |
|---|---|
| config/ | Environment detection, secret management, structured logging |
| medallion_bootstrap/ | One-command setup of your bronze/silver/gold catalog structure with RBAC |
| ingestion_templates/ | Battle-tested pipelines for APIs, databases, files, and streaming |
| cicd_templates/ | Azure DevOps & GitHub Actions pipelines, deployment scripts, test runner |
| unity_catalog_setup/ | SQL scripts for catalogs, external locations, credentials, and governance |
21 files — every one fully runnable, type-hinted, and documented.
Quick Start
1. Upload to Databricks
Upload this kit to a Databricks Repo or the workspace file system:
/Repos/<your-user>/databricks-starter-kit/
Or use the Databricks CLI:
databricks repos create \
--url https://github.com/your-org/databricks-starter-kit \
--provider github
2. Configure Your Environment
Edit config/environment.py to register your workspace IDs:
WORKSPACE_ENV_MAP: dict[str, str] = {
"1234567890123456": "dev", # Your dev workspace org ID
"2345678901234567": "staging", # Your staging workspace
"3456789012345678": "prod", # Your production workspace
}
Update storage account names and catalog prefixes to match your naming conventions.
3. Bootstrap the Medallion Architecture
Edit medallion_bootstrap/config.py with your catalog names and team groups, then run:
# In a Databricks notebook — run these in order:
%run ./medallion_bootstrap/01_create_catalogs
%run ./medallion_bootstrap/02_create_schemas
%run ./medallion_bootstrap/03_grant_permissions
This creates your full bronze → silver → gold catalog structure with proper RBAC in minutes.
4. Run Your First Ingestion
from ingestion_templates.api_ingestion import ApiIngestionPipeline, ApiSourceConfig
pipeline = ApiIngestionPipeline(
pipeline_name="demo_api_load",
source_system="demo_api",
source_class="users",
api_config=ApiSourceConfig(
base_url="https://jsonplaceholder.typicode.com",
endpoint="/users",
),
)
pipeline.run()
File Structure
databricks-starter-kit/
│
├── README.md # This file
├── LICENSE # MIT License
│
├── config/
│ ├── environment.py # Environment detection & workspace config
│ ├── secrets.py # Secret management (Key Vault, scopes, env vars)
│ └── logging_config.py # Structured logging with Delta table sink
│
├── medallion_bootstrap/
│ ├── config.py # Catalog, schema, and permission definitions
│ ├── 01_create_catalogs.py # Create bronze/silver/gold Unity Catalogs
│ ├── 02_create_schemas.py # Create schemas within each catalog
│ └── 03_grant_permissions.py # Grant RBAC permissions to groups
│
├── ingestion_templates/
│ ├── base_pipeline.py # Abstract base class — merge, append, SCD2, dedup
│ ├── api_ingestion.py # REST API ingestion with pagination & retry
│ ├── database_ingestion.py # JDBC ingestion (SQL Server, PostgreSQL, Oracle, MySQL)
│ ├── file_ingestion.py # Auto Loader (cloudFiles) for CSV/JSON/Parquet/Avro/XML
│ └── streaming_ingestion.py # Event Hub & Kafka streaming with checkpoints
│
├── cicd_templates/
│ ├── azure_devops_pipeline.yml # Multi-stage Azure DevOps pipeline
│ ├── github_actions_workflow.yml # GitHub Actions with OIDC & environment protection
│ ├── deploy_notebooks.py # Deploy workflows & notebooks via REST API
│ └── run_tests.py # Test runner — tag validation, structure checks, secrets scan
│
└── unity_catalog_setup/
├── setup_catalogs.sql # SQL to create catalogs with tags and properties
├── setup_external_locations.sql # External locations for each data layer
├── setup_credentials.sql # Storage credentials (Azure MI, SP, AWS IAM, GCP SA)
└── data_governance_policies.md # Governance playbook — RBAC, classification, DQ, audit
Module Deep Dives
config/ — Environment, Secrets & Logging
environment.py — Auto-detects which Databricks workspace you're running in and returns the correct environment config (dev/staging/prod). Uses the workspace org ID for detection, with a serverless-compatible fallback via workspace URL parsing.
from config.environment import get_environment
env = get_environment()
print(env.name) # "dev"
print(env.catalog_prefix) # "dev"
print(env.storage.raw) # "abfss://raw@styourorgdev.dfs.core.windows.net"
secrets.py — Unified secret access with layered lookup: Databricks Secret Scope → Azure Key Vault → environment variables. Includes JDBC connection string builder and in-memory caching with TTL.
from config.secrets import SecretManager
secrets = SecretManager()
api_key = secrets.get("my-api-key")
jdbc_url = secrets.jdbc_url("my-database", driver="sqlserver")
logging_config.py — Structured logging with correlation IDs for tracing a pipeline run across stages. Optionally flushes logs to a Delta table for persistent audit trail and dashboards.
from config.logging_config import PipelineLogger
logger = PipelineLogger(pipeline_name="crm_sync", environment="prod")
logger.info("Starting extraction", extra={"table": "accounts"})
logger.log_metric("rows_processed", 152_000)
logger.flush_to_delta("audit.pipeline_logs")
medallion_bootstrap/ — Catalog Setup in Minutes
Creates your entire Unity Catalog structure — catalogs, schemas, and permissions — from a single configuration file.
What it creates:
| Layer | Catalog | Schemas (customizable) |
|---|---|---|
| Bronze | {env}_bronze |
raw, cdc, streaming
|
| Silver | {env}_silver |
cleansed, conformed, enriched
|
| Gold | {env}_gold |
analytics, reporting, features
|
Permissions are preset-based — assign groups like data_readers, data_engineers, data_admins with sensible defaults or customize granularly.
Edit medallion_bootstrap/config.py once, then the three scripts handle the rest idempotently (safe to re-run).
ingestion_templates/ — Five Pipelines, One Base Class
All pipelines extend BasePipeline, which provides:
-
Write modes:
append,merge(upsert),overwrite,scd2(Slowly Changing Dimension Type 2) -
Metadata columns:
_etl_load_timestamp_utc,_source_system,_source_filenameadded automatically - Deduplication: Window-function-based dedup with configurable key and ordering columns
-
Column sanitizer:
fix_column_names()replaces spaces, dots, and special characters - Delta table init: Auto-creates target tables with configurable properties
API Ingestion (api_ingestion.py)
REST API ingestion with:
- Pagination: offset, cursor, and link-header strategies
- Retry with exponential backoff
- Content hashing for change detection
- Rate limiting
- Incremental and full-refresh modes
pipeline = ApiIngestionPipeline(
pipeline_name="salesforce_accounts",
source_system="salesforce",
source_class="accounts",
api_config=ApiSourceConfig(
base_url="https://your-instance.salesforce.com",
endpoint="/services/data/v58.0/query",
auth_type="bearer",
pagination_type="cursor",
cursor_field="nextRecordsUrl",
),
)
pipeline.run()
Database Ingestion (database_ingestion.py)
JDBC ingestion with:
- Auto driver detection for SQL Server, PostgreSQL, Oracle, MySQL
- Full and incremental load modes (watermark-based)
- Parallel partitioned reads for large tables
- String trimming and type coercion
pipeline = DatabaseIngestionPipeline(
pipeline_name="erp_customers",
source_system="erp",
source_class="customers",
jdbc_config=JdbcSourceConfig(
host="erp-sql-server.database.windows.net",
database="erp_prod",
schema="dbo",
table="Customers",
driver_type="sqlserver",
load_mode="incremental",
watermark_column="ModifiedDate",
),
)
pipeline.run()
File Ingestion (file_ingestion.py)
Cloud Files (Auto Loader) ingestion with:
- Formats: CSV, JSON, Parquet, Avro, XML
- Schema inference with evolution support
-
foreachBatchwriter for merge/upsert targets - Trigger-once (batch) and continuous streaming modes
pipeline = FileIngestionPipeline(
pipeline_name="landing_csvs",
source_system="external_vendor",
source_class="transactions",
file_config=FileSourceConfig(
source_path="abfss://landing@storage.dfs.core.windows.net/vendor/",
file_format="csv",
header=True,
trigger_mode="trigger_once",
),
)
pipeline.run()
Streaming Ingestion (streaming_ingestion.py)
Event Hub and Kafka ingestion with:
- Event Hub auth: OAuth (Managed Identity) or connection string
- Kafka auth: SASL/SSL, plaintext, or OAuth
- Checkpoint management for exactly-once semantics
- JSON value parsing with schema
- Configurable triggers and watermarks
from ingestion_templates.streaming_ingestion import (
StreamingIngestionPipeline,
EventHubConfig,
)
pipeline = StreamingIngestionPipeline(
pipeline_name="clickstream_events",
source_system="event_hub",
source_class="clickstream",
streaming_config=EventHubConfig(
namespace="your-eventhub-namespace",
topic="clickstream",
consumer_group="databricks-consumer",
auth_type="oauth",
),
)
pipeline.run()
cicd_templates/ — Ship With Confidence
Azure DevOps (azure_devops_pipeline.yml)
Multi-stage pipeline with:
- Service principal authentication
- Git-diff-based workflow deployment (only deploys what changed)
- Test stage with ruff linting and pytest
- Manual approval gate for production
- Parameterized for dev/staging/prod
GitHub Actions (github_actions_workflow.yml)
Workflow with:
- OIDC authentication (no stored secrets)
- Environment protection rules
- PR validation (lint + test)
- Matrix deployment across environments
Deployment Script (deploy_notebooks.py)
CLI tool that:
- Deploys workflow definitions from JSON files via the Jobs API 2.1
- Updates Databricks Repos to a specific branch or tag
- Imports individual notebooks via the Workspace API
- Supports dry-run mode for previewing changes
# Deploy a workflow
python deploy_notebooks.py deploy-workflow \
--workspace-url https://adb-123.azuredatabricks.net \
--file workflows/my_pipeline.json
# Dry run
python deploy_notebooks.py deploy-workflow \
--file workflows/my_pipeline.json \
--dry-run
Test Runner (run_tests.py)
Validates your Databricks project:
- Workflow tag validation (required tags, valid values)
- JSON structure validation
- Hardcoded secret detection (scans for API keys, passwords, tokens)
- Python syntax checking
- JUnit XML output for CI integration
python run_tests.py --workflow-dir workflows/ --output results.xml
unity_catalog_setup/ — Governance From Day One
SQL scripts and a governance playbook for Unity Catalog:
-
setup_catalogs.sql— Creates bronze/silver/gold/audit catalogs with tags and default schemas -
setup_external_locations.sql— External locations for each layer (raw, bronze, silver, gold, checkpoints, export) -
setup_credentials.sql— Storage credential templates for Azure Managed Identity, Service Principal, AWS IAM Role, and GCP Service Account -
data_governance_policies.md— Complete governance playbook covering:- RBAC model with group hierarchy
- Data classification tiers (public, internal, confidential, restricted) with tagging SQL
- Row-level and column-level security patterns
- Data retention and VACUUM policies
- Data quality constraints and quarantine patterns
- Audit queries for tracking access
- Compliance checklist
Customization Guide
Naming Conventions
Every naming pattern is configurable. Key places to update:
| What | Where | Default Pattern |
|---|---|---|
| Catalog names | medallion_bootstrap/config.py |
{env}_bronze, {env}_silver, {env}_gold
|
| Schema names | medallion_bootstrap/config.py |
raw, cleansed, analytics, etc. |
| Storage accounts | config/environment.py |
styourorg{env} |
| Secret scope | config/secrets.py |
pipeline_secrets |
| Log table | config/logging_config.py |
audit.pipeline_logs |
Adding a New Ingestion Pipeline
- Create a new file in
ingestion_templates/ - Extend
BasePipeline:
from ingestion_templates.base_pipeline import BasePipeline
class MyCustomPipeline(BasePipeline):
def extract(self) -> DataFrame:
# Your extraction logic
...
def transform(self, df: DataFrame) -> DataFrame:
# Your transformation logic (optional override)
...
- The base class handles Delta table creation, metadata columns, deduplication, and write modes for you.
Adding a New Environment
In config/environment.py, add your workspace org ID to WORKSPACE_ENV_MAP and create a corresponding EnvironmentConfig in the _build_environments() function.
Switching Cloud Providers
The kit is designed for Azure by default but is adaptable:
-
AWS: Update storage paths from
abfss://tos3://, swap Key Vault references for AWS Secrets Manager, usesetup_credentials.sqlAWS IAM Role template -
GCP: Update storage paths to
gs://, swap for GCP Secret Manager, use the GCP Service Account credential template
Requirements
| Requirement | Version |
|---|---|
| Databricks Runtime | 13.x or later |
| Python | 3.10+ |
| Unity Catalog | Enabled on workspace |
| Delta Lake | Included with DBR 13.x+ |
Optional Dependencies
-
azure-identity— For direct Key Vault access (if not using Databricks-backed scopes) -
requests— For API ingestion pipelines (pre-installed on DBR) -
databricks-sdk— For deployment scripts
Frequently Asked Questions
Can I use this without Unity Catalog?
The ingestion templates and config modules work without UC. The medallion bootstrap and unity catalog setup modules require UC to be enabled.
Does this work on AWS or GCP Databricks?
Yes. The core patterns (Delta Lake, Spark, structured streaming) are cloud-agnostic. Storage paths and credential setup need adjustment — see the Customization Guide above.
Can I use this with Databricks Community Edition?
Partially. The ingestion templates and config modules will work. Unity Catalog features and some streaming connectors require a paid workspace.
What's the difference between medallion_bootstrap and unity_catalog_setup?
medallion_bootstrap/ contains Python scripts that create catalogs, schemas, and permissions programmatically — ideal for automation. unity_catalog_setup/ contains SQL scripts and a governance playbook — ideal for running interactively or as part of a Terraform/Pulumi workflow.
This is 1 of 6 resources in the DataStack Pro toolkit. Get the complete [Databricks Starter Kit] with all files, templates, and documentation for $39.
Or grab the entire DataStack Pro bundle (6 products) for $164 — save 30%.
Top comments (0)