Part 2: Project Architecture

#architecture #data #dataengineering #systemdesign

The goal is not just to “make streaming work”, but to design a maintainable and observable streaming platform.

At a high level, the platform follows a Medallion Architecture, which organizes data into progressive layers of refinement:

Bronze: Raw, append-only streaming ingestion
Silver: Cleaned, enriched, normalized data
Gold: Aggregated, business-ready metrics

Architectural flow

The project outlines end-to-end real time data pipeline built on Databricks, following the Medallion Architecture pattern. Each stage progressively refines data from raw events into business-ready insights.

Databricks Sample Data

At the top of the pipeline, Databricks-provided sample datasets (in this case, NYC Taxi trip data) act as the data source. These datasets contain realistic event timestamps, numeric measures, and location attributes, making them suitable for simulating real-world streaming use cases without requiring external systems.

Simulated Streaming Input

Because the sample data is static by default, it is first written incrementally as files into cloud storage (DBFS). This step simulates real-time data arrival, mimicking how production systems often receive data from upstream applications, IoT devices, or operational databases via files landing in object storage.

New files arriving in this directory represent new streaming events.

Auto Loader

Databricks Auto Loader continuously monitors the input directory and efficiently detects newly arrived files, provides schema inference and evolution.
Auto Loader integrates natively with Spark Structured Streaming, allowing file-based ingestion to behave like a true streaming source.

Bronze Delta Tables (Raw Layer)

The Bronze layer stores raw, append-only data exactly as it arrives from the source, with minimal transformation.
This layer ensures that raw data is always preserved, enabling replay, debugging, and full reprocessing if needed.

Silver Delta Tables (Cleaned & Enriched Layer)

In the Silver layer, data is cleansed, standardized, and enriched. Like
Date type normalization, Filtering invalid or malformed records and Joining with dimension tables (for example, ZIP code to region mappings).

Silver tables represent trusted, analytics-ready data that can be reused across multiple downstream use cases.

Gold Delta Tables (Business Layer)

The Gold layer contains aggregated, business-focused datasets designed for analytics and reporting. For example,

Hourly trip counts by region
Revenue metrics

This layer often uses event-time processing, windowed aggregations, and watermarking to handle late-arriving data while keeping state bounded.

Databricks SQL Dashboards

Finally, Gold tables are consumed by Databricks SQL Dashboards. As new data flows through the pipeline, dashboards update automatically, closing the loop from raw events to actionable insights.

Together, these components form a robust, scalable, and maintainable real-time data platform.

Happy learning!

DEV Community

Part 2: Project Architecture

Architectural flow

Top comments (0)