DEV Community

prakhyatkarri
prakhyatkarri

Posted on

From a Unified Bronze Layer to Multiple Silver Layers: Streamlining Data Transformation in Databricks Unity Catalog

Introduction

In modern data engineering workflows, managing raw data and its transformations is crucial for providing clean, reliable, and structured data to analytics teams and downstream applications. Databricks Unity Catalog provides a unified data governance model that simplifies and secures data access across the entire organization. This guide will walk through a comprehensive process for ingesting raw data into a shared bronze layer and transforming it into multiple silver layers in a Unity Catalog setup, with data stored in catalog-managed tables instead of traditional file paths. We will use examples and data flow diagrams to illustrate the concepts.

Overview of the Layers

Raw Layer: Unstructured or semi-structured data as it arrives from source systems (e.g., logs, IoT feeds, application data).
Bronze Layer: A landing area where raw data is cleaned and minimally processed. This layer maintains data fidelity while addressing obvious data quality issues.
Silver Layer: Curated data for specific use cases. Data from the bronze layer is cleaned, structured, and aggregated into independent tables optimized for downstream use (e.g., analytics, reporting, machine learning).
Data Flow Diagram

+-------------------+          +-------------------+          +-------------------+
|    Raw Layer      |   --->   |   Bronze Layer    |   --->   |  Silver Layers    |
| (External Data)   |          | (Cleaned Data)    |          | (Curated Data)    |
+-------------------+          +-------------------+          +-------------------+

                                       |
                           +-----------+-----------+
                           |                       |
               +-------------------+     +-------------------+     +-------------------+
               |  Bronze to Silver 1|     |  Bronze to Silver 2|     |  Bronze to Silver n|
               | (Specific Use Case)|     | (Specific Use Case)|     | (Specific Use Case)|
               +-------------------+     +-------------------+     +-------------------+

Enter fullscreen mode Exit fullscreen mode

Step-by-Step Guide

1. Ingesting Data into the Raw Layer

The Raw Layer typically represents data coming from various external sources such as JSON, CSV, or logs, and is unprocessed. This data is brought into Databricks either through batch or streaming jobs. You could use Databricks Auto Loader for efficient file ingestion from cloud storage systems like Azure Data Lake Storage (ADLS).

Here’s an example of a batch ingestion job using PySpark:

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("RawToBronzeIngestion").getOrCreate()

# Define source and target paths for raw data
raw_data_path = "abfss://raw@storage_account.dfs.core.windows.net/dataset/"
bronze_table = "catalog_name.schema_name.bronze_table"

# Load the raw data
raw_df = spark.read.option("header", True).csv(raw_data_path)

# Write the raw data into the bronze layer
raw_df.write.format("delta").mode("overwrite").saveAsTable(bronze_table)

Enter fullscreen mode Exit fullscreen mode

Key Points:

Source: Data is ingested from external systems or cloud storage.
Target: The raw data is ingested directly into a Unity Catalog table in the bronze layer.
Table Management: Unity Catalog manages the metadata for the tables, providing security and governance.

2. Bronze Layer: Initial Cleansing and Normalization

The Bronze Layer is a lightly processed version of the raw data. The data is stored in Unity Catalog tables, allowing for basic cleansing and structural changes (e.g., handling nulls, normalizing data types). This layer acts as a source of truth and an archive of the original data with minor cleaning.

Let’s clean the raw data and store it in the bronze table:

from pyspark.sql.functions import col

# Clean and transform raw data
bronze_df = raw_df \
    .dropDuplicates() \
    .filter(col("important_column").isNotNull()) \
    .withColumn("processed_at", current_timestamp())

# Write the cleaned data into the bronze table
bronze_df.write.format("delta").mode("overwrite").saveAsTable(bronze_table)

Enter fullscreen mode Exit fullscreen mode

Key Steps:

Data Cleansing: Remove duplicates, null values, and other obvious quality issues.
Column Enrichment: Add processing metadata such as timestamps for auditing.
Governance: Data in the bronze layer is managed and versioned with Delta Lake, enhancing traceability.

3. Transforming Bronze Data into Multiple Silver Layers

The Silver Layer represents clean, structured, and optimized datasets for specific downstream use cases. These could include separate tables for different business units or functional areas. The goal is to provide ready-to-use data for analytics, machine learning, or reporting.

Each silver table is designed with specific business logic in mind. Below, we walk through two different transformations from the bronze layer into silver layers.

Silver Layer 1: Customer Data Cleansing and Aggregation

For example, if you have customer data, you may want to clean and aggregate customer transactions for reporting purposes.

# Load the bronze data
bronze_df = spark.table(bronze_table)

# Transform and cleanse data for the silver layer
silver_customer_df = bronze_df \
    .filter(col("transaction_type") == "purchase") \
    .groupBy("customer_id") \
    .agg(sum("transaction_amount").alias("total_purchase"))

# Write the curated data to the silver layer
silver_customer_table = "catalog_name.schema_name.silver_customers"
silver_customer_df.write.format("delta").mode("overwrite").saveAsTable(silver_customer_table)

Enter fullscreen mode Exit fullscreen mode

Silver Layer 2: Product Data Transformation

You might have product data that needs to be structured for inventory tracking:

# Filter and transform product-related data
silver_product_df = bronze_df \
    .filter(col("category").isNotNull()) \
    .select("product_id", "category", "inventory_status")

# Write the product data to another silver table
silver_product_table = "catalog_name.schema_name.silver_products"
silver_product_df.write.format("delta").mode("overwrite").saveAsTable(silver_product_table)

Enter fullscreen mode Exit fullscreen mode

Key Points for Silver Layer:

Business Logic: Each silver table represents a different subset of the bronze data, filtered and transformed to serve specific use cases.
Table Partitioning: Silver tables can be partitioned for performance optimization based on frequently queried columns (e.g., date, customer region).
Storage Optimization: Data stored in the silver layer can be aggregated, cleaned, and formatted for efficient querying.

4. Managing Unity Catalog Tables

All data at each stage (raw, bronze, silver) is managed as Unity Catalog tables. This provides centralized control over access permissions, schema evolution, and lineage tracking. You can also enforce data masking or row-level security policies based on user roles.

To view and manage the tables:

-- List all Unity Catalog tables in a specific schema
SHOW TABLES IN catalog_name.schema_name;

Enter fullscreen mode Exit fullscreen mode

5. Scheduling and Automation of Pipelines

Data pipelines for moving from raw to bronze and from bronze to silver can be automated using Databricks Workflows or Azure Data Factory. This ensures the ingestion and transformation processes are consistent and run on predefined schedules.

Here’s a sample setup of a Databricks job to automate this pipeline:

# Job Configuration (Pseudo Code)
job = {
    "name": "BronzeToSilverPipeline",
    "tasks": [
        {
            "task_key": "ingest_raw_data",
            "notebook_path": "/Repos/IngestRawData",
            "schedule": "0 2 * * *"  # Daily at 2 AM
        },
        {
            "task_key": "transform_to_silver",
            "notebook_path": "/Repos/TransformToSilver",
            "depends_on": "ingest_raw_data"
        }
    ]
}

Enter fullscreen mode Exit fullscreen mode

This ensures each pipeline step runs in sequence and that data is consistently updated from raw to bronze to silver layers.

Conclusion

Ingesting and transforming data within Databricks using Unity Catalog offers a scalable and governed approach for managing data pipelines. By following this shared bronze layer and independent silver layers architecture, organizations can ensure data reliability, performance, and security. Leveraging Delta Lake and Unity Catalog features like access control and auditing ensures compliance while providing the flexibility to transform and prepare data for diverse use cases.

This architecture can be expanded further with additional layers (e.g., gold layer for final analytics) and extended with machine learning models. By managing all data in catalog-based tables, you create a unified environment with centralized governance and control.

Some reference links to help you build data pipelines using Databricks, Delta lake, Pyspark and Unity Catalog.

Unity Catalog Overview

Unity Catalog Docs

Delta Lake Guide

Delta Lake Docs

Auto Loader for Data Ingestion

Auto Loader Docs

PySpark for Data Transformations

PySpark Docs

Databricks Jobs and Workflows

Jobs Docs

Azure Data Lake Integration

ADLS Docs

Top comments (0)