DatanestDigital

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Databricks Notebook Framework: Project Scaffold — Quick Start Guide

#databricks #azure #dataengineering #devops

Project Scaffold — Quick Start Guide

Databricks Notebook Framework by Datanest Digital

Overview

This guide walks you through setting up a new Databricks data pipeline project using the Notebook Framework. In 15 minutes you'll have a working project structure with Bronze, Silver, and Gold layers ready for development.

Step 1: Create Your Project Structure

# Create project root
mkdir my-data-project && cd my-data-project
git init

# Copy framework files
cp -r /path/to/databricks-notebook-framework/* .

# Create additional directories
mkdir -p notebooks/{bronze,silver,gold}
mkdir -p tests
mkdir -p config

Your project should look like this:

my-data-project/
├── .pre-commit-config.yaml    # Copy from cicd/pre-commit-config.yaml
├── databricks.yml             # Copy from cicd/databricks.yml
├── notebooks/
│   ├── bronze/
│   │   └── bronze_my_source.py
│   ├── silver/
│   │   └── silver_my_entity.py
│   └── gold/
│       └── gold_my_summary.py
├── utils/
│   ├── logging_utils.py
│   ├── config_manager.py
│   ├── quality_checks.py
│   └── secrets_manager.py
├── tests/
│   ├── conftest.py            # Copy from testing/conftest.py
│   └── test_my_pipeline.py
├── config/
│   └── sources.json           # Optional: source configuration
├── standards/
│   └── NOTEBOOK_STANDARDS.md
└── README.md

Step 2: Set Up Pre-commit Hooks

# Copy the config to your project root
cp cicd/pre-commit-config.yaml .pre-commit-config.yaml

# Run against all files to verify
pre-commit run --all-files

Step 3: Create Your First Bronze Notebook

Copy the template:

cp notebooks/bronze_ingest_template.py notebooks/bronze/bronze_my_source.py

Edit notebooks/bronze/bronze_my_source.py:
- Update widget defaults for your source system
- Replace the "Read Source Data" cell with your actual source read logic
- Adjust watermark column name if needed
Example — reading from Azure Blob Storage:

# Replace the placeholder cell with:
source_path = f"abfss://landing@mystorageaccount.dfs.core.windows.net/{source}/"

df_raw = (
    spark.read
    .format("parquet")
    .option("mergeSchema", "true")
    .load(source_path)
)

Step 4: Create Your First Silver Notebook

Copy the template:

cp notebooks/silver_transform_template.py notebooks/silver/silver_my_entity.py

Edit notebooks/silver/silver_my_entity.py:
- Set your primary key columns in the widget default
- Define type cast mappings in the TYPE_CAST_MAP dictionary
- Add custom quality checks for your business rules
Example type casting:

TYPE_CAST_MAP = {
    "order_amount": DoubleType(),
    "quantity": IntegerType(),
    "order_date": DateType(),
    "is_active": BooleanType(),
}

Step 5: Create Your First Gold Notebook

Copy the template:

cp notebooks/gold_aggregate_template.py notebooks/gold/gold_my_summary.py

Edit notebooks/gold/gold_my_summary.py:
- Add your Silver table reads
- Implement your aggregation logic using the provided patterns
- Assign the final DataFrame to df_gold

Step 6: Configure DABs Deployment

Copy and edit the bundle config:

cp cicd/databricks.yml databricks.yml

Update databricks.yml:
- Replace workspace host URLs with your actual Databricks workspace
- Update notebook paths to match your project structure
- Adjust cluster configurations for your workload
- Set your notification email
Validate:

databricks bundle validate

Deploy to dev:

databricks bundle deploy -t dev

Step 7: Write Tests

Copy the conftest:

cp testing/conftest.py tests/conftest.py

Create a test file tests/test_my_pipeline.py:

from testing.test_framework import NutterTestBase, DataFrameAssertions
from pyspark.sql import functions as F


class TestMyPipeline(NutterTestBase):
    def before_transform_adds_columns(self):
        self.input_df = self.spark.createDataFrame(
            [(1, "Alice", "100.50"), (2, "Bob", "200.75")],
            ["id", "name", "amount"],
        )

    def run_transform_adds_columns(self):
        self.result_df = (
            self.input_df
            .withColumn("amount_numeric", F.col("amount").cast("double"))
        )

    def assertion_transform_adds_columns(self):
        dfa = DataFrameAssertions()
        dfa.assert_schema_contains(self.result_df, ["amount_numeric"])
        dfa.assert_row_count(self.result_df, expected=2)


def test_my_pipeline(spark):
    suite = TestMyPipeline(spark)
    report = suite.execute_tests()
    report.print_summary()
    assert report.all_passed

Run tests:

pip install pytest pyspark delta-spark
pytest tests/ -v

Step 8: Source Configuration (Optional)

Create a config/sources.json to manage multiple sources declaratively:

{
  "sources": [
    {
      "name": "erp_orders",
      "source_type": "parquet",
      "source_path": "abfss://landing@storage.dfs.core.windows.net/erp/orders/",
      "watermark_column": "modified_at",
      "primary_keys": ["order_id"],
      "bronze_schema": "bronze",
      "silver_schema": "silver",
      "type_casts": {
        "order_amount": "double",
        "quantity": "integer",
        "order_date": "date"
      },
      "quality_checks": {
        "not_null": ["order_id", "customer_id", "order_date"],
        "unique": ["order_id"]
      }
    }
  ]
}

Common Commands

# Validate DABs bundle
databricks bundle validate

# Deploy to environment
databricks bundle deploy -t dev
databricks bundle deploy -t staging
databricks bundle deploy -t prod

# Run a specific job
databricks bundle run -t dev bronze_ingest_job

# Run tests
pytest tests/ -v

# Run pre-commit checks
pre-commit run --all-files

# Check notebook format
python -c "print(open('notebooks/bronze/bronze_my_source.py').readline())"
# Should output: # Databricks notebook source

Need Help?

Review standards/NOTEBOOK_STANDARDS.md for coding conventions
Check the utility module docstrings for usage examples
Visit datanest.dev for documentation and support

This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Databricks Notebook Framework] with all files, templates, and documentation for $59.

Get the Full Kit →

Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.

Get the Complete Bundle →

DEV Community