DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Databricks Notebook Framework: Project Scaffold — Quick Start Guide

Project Scaffold — Quick Start Guide

Databricks Notebook Framework by Datanest Digital


Overview

This guide walks you through setting up a new Databricks data pipeline project using the Notebook Framework. In 15 minutes you'll have a working project structure with Bronze, Silver, and Gold layers ready for development.


Step 1: Create Your Project Structure

# Create project root
mkdir my-data-project && cd my-data-project
git init

# Copy framework files
cp -r /path/to/databricks-notebook-framework/* .

# Create additional directories
mkdir -p notebooks/{bronze,silver,gold}
mkdir -p tests
mkdir -p config
Enter fullscreen mode Exit fullscreen mode

Your project should look like this:

my-data-project/
├── .pre-commit-config.yaml    # Copy from cicd/pre-commit-config.yaml
├── databricks.yml             # Copy from cicd/databricks.yml
├── notebooks/
│   ├── bronze/
│   │   └── bronze_my_source.py
│   ├── silver/
│   │   └── silver_my_entity.py
│   └── gold/
│       └── gold_my_summary.py
├── utils/
│   ├── logging_utils.py
│   ├── config_manager.py
│   ├── quality_checks.py
│   └── secrets_manager.py
├── tests/
│   ├── conftest.py            # Copy from testing/conftest.py
│   └── test_my_pipeline.py
├── config/
│   └── sources.json           # Optional: source configuration
├── standards/
│   └── NOTEBOOK_STANDARDS.md
└── README.md
Enter fullscreen mode Exit fullscreen mode

Step 2: Set Up Pre-commit Hooks

# Copy the config to your project root
cp cicd/pre-commit-config.yaml .pre-commit-config.yaml

# Run against all files to verify
pre-commit run --all-files
Enter fullscreen mode Exit fullscreen mode

Step 3: Create Your First Bronze Notebook

  1. Copy the template:
cp notebooks/bronze_ingest_template.py notebooks/bronze/bronze_my_source.py
Enter fullscreen mode Exit fullscreen mode
  1. Edit notebooks/bronze/bronze_my_source.py:

    • Update widget defaults for your source system
    • Replace the "Read Source Data" cell with your actual source read logic
    • Adjust watermark column name if needed
  2. Example — reading from Azure Blob Storage:

# Replace the placeholder cell with:
source_path = f"abfss://landing@mystorageaccount.dfs.core.windows.net/{source}/"

df_raw = (
    spark.read
    .format("parquet")
    .option("mergeSchema", "true")
    .load(source_path)
)
Enter fullscreen mode Exit fullscreen mode

Step 4: Create Your First Silver Notebook

  1. Copy the template:
cp notebooks/silver_transform_template.py notebooks/silver/silver_my_entity.py
Enter fullscreen mode Exit fullscreen mode
  1. Edit notebooks/silver/silver_my_entity.py:

    • Set your primary key columns in the widget default
    • Define type cast mappings in the TYPE_CAST_MAP dictionary
    • Add custom quality checks for your business rules
  2. Example type casting:

TYPE_CAST_MAP = {
    "order_amount": DoubleType(),
    "quantity": IntegerType(),
    "order_date": DateType(),
    "is_active": BooleanType(),
}
Enter fullscreen mode Exit fullscreen mode

Step 5: Create Your First Gold Notebook

  1. Copy the template:
cp notebooks/gold_aggregate_template.py notebooks/gold/gold_my_summary.py
Enter fullscreen mode Exit fullscreen mode
  1. Edit notebooks/gold/gold_my_summary.py:
    • Add your Silver table reads
    • Implement your aggregation logic using the provided patterns
    • Assign the final DataFrame to df_gold

Step 6: Configure DABs Deployment

  1. Copy and edit the bundle config:
cp cicd/databricks.yml databricks.yml
Enter fullscreen mode Exit fullscreen mode
  1. Update databricks.yml:

    • Replace workspace host URLs with your actual Databricks workspace
    • Update notebook paths to match your project structure
    • Adjust cluster configurations for your workload
    • Set your notification email
  2. Validate:

databricks bundle validate
Enter fullscreen mode Exit fullscreen mode
  1. Deploy to dev:
databricks bundle deploy -t dev
Enter fullscreen mode Exit fullscreen mode

Step 7: Write Tests

  1. Copy the conftest:
cp testing/conftest.py tests/conftest.py
Enter fullscreen mode Exit fullscreen mode
  1. Create a test file tests/test_my_pipeline.py:
from testing.test_framework import NutterTestBase, DataFrameAssertions
from pyspark.sql import functions as F


class TestMyPipeline(NutterTestBase):
    def before_transform_adds_columns(self):
        self.input_df = self.spark.createDataFrame(
            [(1, "Alice", "100.50"), (2, "Bob", "200.75")],
            ["id", "name", "amount"],
        )

    def run_transform_adds_columns(self):
        self.result_df = (
            self.input_df
            .withColumn("amount_numeric", F.col("amount").cast("double"))
        )

    def assertion_transform_adds_columns(self):
        dfa = DataFrameAssertions()
        dfa.assert_schema_contains(self.result_df, ["amount_numeric"])
        dfa.assert_row_count(self.result_df, expected=2)


def test_my_pipeline(spark):
    suite = TestMyPipeline(spark)
    report = suite.execute_tests()
    report.print_summary()
    assert report.all_passed
Enter fullscreen mode Exit fullscreen mode
  1. Run tests:
pip install pytest pyspark delta-spark
pytest tests/ -v
Enter fullscreen mode Exit fullscreen mode

Step 8: Source Configuration (Optional)

Create a config/sources.json to manage multiple sources declaratively:

{
  "sources": [
    {
      "name": "erp_orders",
      "source_type": "parquet",
      "source_path": "abfss://landing@storage.dfs.core.windows.net/erp/orders/",
      "watermark_column": "modified_at",
      "primary_keys": ["order_id"],
      "bronze_schema": "bronze",
      "silver_schema": "silver",
      "type_casts": {
        "order_amount": "double",
        "quantity": "integer",
        "order_date": "date"
      },
      "quality_checks": {
        "not_null": ["order_id", "customer_id", "order_date"],
        "unique": ["order_id"]
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Common Commands

# Validate DABs bundle
databricks bundle validate

# Deploy to environment
databricks bundle deploy -t dev
databricks bundle deploy -t staging
databricks bundle deploy -t prod

# Run a specific job
databricks bundle run -t dev bronze_ingest_job

# Run tests
pytest tests/ -v

# Run pre-commit checks
pre-commit run --all-files

# Check notebook format
python -c "print(open('notebooks/bronze/bronze_my_source.py').readline())"
# Should output: # Databricks notebook source
Enter fullscreen mode Exit fullscreen mode

Need Help?

  • Review standards/NOTEBOOK_STANDARDS.md for coding conventions
  • Check the utility module docstrings for usage examples
  • Visit datanest.dev for documentation and support

This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Databricks Notebook Framework] with all files, templates, and documentation for $59.

Get the Full Kit →

Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)