Part 4: Building the Bronze Layer with Auto Loader and Delta Lake

#architecture #dataengineering #tutorial #sql

The Bronze layer is the foundation of the entire streaming architecture. Its role is to ingest data exactly as it arrives and store it durably. Possibly by adding some timestamps on when the event arrived.

Creating schemas and volumes

We create the catalog, bronze schema and volumes required to store the metadata and various checkpoints data in the databricks.

%sql
CREATE CATALOG IF NOT EXISTS nyc_taxi;
CREATE SCHEMA IF NOT EXISTS nyc_taxi.bronze;
CREATE SCHEMA IF NOT EXISTS nyc_taxi.infra;
CREATE VOLUME IF NOT EXISTS nyc_taxi.infra.autoloader;
CREATE VOLUME IF NOT EXISTS nyc_taxi.infra.checkpoints;

Using Auto loader for Bronze ingestion

Databricks Auto Loader is purpose-built for scalable file-based ingestion.
In this project, Auto Loader continuously watches the directory created in Part 2 and processes only newly arrived files.

We start by defining a streaming DataFrame using Auto Loader:

bronze_stream = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.schemaLocation", "/Volumes/nyc_taxi/infra/autoloader/metadata")
.load("/tmp/taxi_stream_input")
)

cloudFiles.format- Specifies the underlying file format (json in this case).

cloudFiles.inferColumnTypes - Automatically infers data types instead of defaulting to strings.

cloudFiles.schemaLocation - Stores inferred schemas and supports schema evolution across runs.

Writing to Bronze Delta Tables

Next, we write the streaming data into a Bronze Delta table.

(
bronze_stream.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/Volumes/nyc_taxi/infra/checkpoints/bronze/taxi_trips")
.trigger(availableNow=True)
.toTable("nyc_taxi.bronze.taxi_trips")
)

Using Delta tables in the Bronze layer provides:

ACID guarantees on streaming writes
Schema enforcement and evolution
Reliable checkpointing
Compatibility with batch and streaming reads

This pipeline uses .trigger(availableNow=True) This will processes all available files and stops automatically when finished. This can be scheduled periodically. Reduces cost compared to always-on streaming

In practice, this behaves like incremental batch processing — ideal for cloud storage–based ingestion.

The raw data is now ingested into bronze delta tables using autoloader and is ready to be refined further in the silver layer.

Happy learning!