Project Scaffold — Quick Start Guide
Databricks Notebook Framework by Datanest Digital
Overview
This guide walks you through setting up a new Databricks data pipeline project using the Notebook Framework. In 15 minutes you'll have a working project structure with Bronze, Silver, and Gold layers ready for development.
Step 1: Create Your Project Structure
# Create project root
mkdir my-data-project && cd my-data-project
git init
# Copy framework files
cp -r /path/to/databricks-notebook-framework/* .
# Create additional directories
mkdir -p notebooks/{bronze,silver,gold}
mkdir -p tests
mkdir -p config
Your project should look like this:
my-data-project/
├── .pre-commit-config.yaml # Copy from cicd/pre-commit-config.yaml
├── databricks.yml # Copy from cicd/databricks.yml
├── notebooks/
│ ├── bronze/
│ │ └── bronze_my_source.py
│ ├── silver/
│ │ └── silver_my_entity.py
│ └── gold/
│ └── gold_my_summary.py
├── utils/
│ ├── logging_utils.py
│ ├── config_manager.py
│ ├── quality_checks.py
│ └── secrets_manager.py
├── tests/
│ ├── conftest.py # Copy from testing/conftest.py
│ └── test_my_pipeline.py
├── config/
│ └── sources.json # Optional: source configuration
├── standards/
│ └── NOTEBOOK_STANDARDS.md
└── README.md
Step 2: Set Up Pre-commit Hooks
# Copy the config to your project root
cp cicd/pre-commit-config.yaml .pre-commit-config.yaml
# Run against all files to verify
pre-commit run --all-files
Step 3: Create Your First Bronze Notebook
- Copy the template:
cp notebooks/bronze_ingest_template.py notebooks/bronze/bronze_my_source.py
-
Edit
notebooks/bronze/bronze_my_source.py:- Update widget defaults for your source system
- Replace the "Read Source Data" cell with your actual source read logic
- Adjust watermark column name if needed
Example — reading from Azure Blob Storage:
# Replace the placeholder cell with:
source_path = f"abfss://landing@mystorageaccount.dfs.core.windows.net/{source}/"
df_raw = (
spark.read
.format("parquet")
.option("mergeSchema", "true")
.load(source_path)
)
Step 4: Create Your First Silver Notebook
- Copy the template:
cp notebooks/silver_transform_template.py notebooks/silver/silver_my_entity.py
-
Edit
notebooks/silver/silver_my_entity.py:- Set your primary key columns in the widget default
- Define type cast mappings in the
TYPE_CAST_MAPdictionary - Add custom quality checks for your business rules
Example type casting:
TYPE_CAST_MAP = {
"order_amount": DoubleType(),
"quantity": IntegerType(),
"order_date": DateType(),
"is_active": BooleanType(),
}
Step 5: Create Your First Gold Notebook
- Copy the template:
cp notebooks/gold_aggregate_template.py notebooks/gold/gold_my_summary.py
- Edit
notebooks/gold/gold_my_summary.py:- Add your Silver table reads
- Implement your aggregation logic using the provided patterns
- Assign the final DataFrame to
df_gold
Step 6: Configure DABs Deployment
- Copy and edit the bundle config:
cp cicd/databricks.yml databricks.yml
-
Update
databricks.yml:- Replace workspace host URLs with your actual Databricks workspace
- Update notebook paths to match your project structure
- Adjust cluster configurations for your workload
- Set your notification email
Validate:
databricks bundle validate
- Deploy to dev:
databricks bundle deploy -t dev
Step 7: Write Tests
- Copy the conftest:
cp testing/conftest.py tests/conftest.py
- Create a test file
tests/test_my_pipeline.py:
from testing.test_framework import NutterTestBase, DataFrameAssertions
from pyspark.sql import functions as F
class TestMyPipeline(NutterTestBase):
def before_transform_adds_columns(self):
self.input_df = self.spark.createDataFrame(
[(1, "Alice", "100.50"), (2, "Bob", "200.75")],
["id", "name", "amount"],
)
def run_transform_adds_columns(self):
self.result_df = (
self.input_df
.withColumn("amount_numeric", F.col("amount").cast("double"))
)
def assertion_transform_adds_columns(self):
dfa = DataFrameAssertions()
dfa.assert_schema_contains(self.result_df, ["amount_numeric"])
dfa.assert_row_count(self.result_df, expected=2)
def test_my_pipeline(spark):
suite = TestMyPipeline(spark)
report = suite.execute_tests()
report.print_summary()
assert report.all_passed
- Run tests:
pip install pytest pyspark delta-spark
pytest tests/ -v
Step 8: Source Configuration (Optional)
Create a config/sources.json to manage multiple sources declaratively:
{
"sources": [
{
"name": "erp_orders",
"source_type": "parquet",
"source_path": "abfss://landing@storage.dfs.core.windows.net/erp/orders/",
"watermark_column": "modified_at",
"primary_keys": ["order_id"],
"bronze_schema": "bronze",
"silver_schema": "silver",
"type_casts": {
"order_amount": "double",
"quantity": "integer",
"order_date": "date"
},
"quality_checks": {
"not_null": ["order_id", "customer_id", "order_date"],
"unique": ["order_id"]
}
}
]
}
Common Commands
# Validate DABs bundle
databricks bundle validate
# Deploy to environment
databricks bundle deploy -t dev
databricks bundle deploy -t staging
databricks bundle deploy -t prod
# Run a specific job
databricks bundle run -t dev bronze_ingest_job
# Run tests
pytest tests/ -v
# Run pre-commit checks
pre-commit run --all-files
# Check notebook format
python -c "print(open('notebooks/bronze/bronze_my_source.py').readline())"
# Should output: # Databricks notebook source
Need Help?
- Review
standards/NOTEBOOK_STANDARDS.mdfor coding conventions - Check the utility module docstrings for usage examples
- Visit datanest.dev for documentation and support
This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Databricks Notebook Framework] with all files, templates, and documentation for $59.
Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.
Top comments (0)