End-to-End Real-Time Data Engineering on Databricks Using Spark Structured Streaming and Delta Lake

#dataengineering #handson #realtimeproject #spark

Simple batch processing and static dashboards have retired!

Data platforms must ingest continuously arriving data, gracefully handle late and out-of-order events, scale efficiently, and still deliver reliable, business-ready metrics in no to near real time!

In this blog series, we shall explore how to build an end-to-end real time streaming data platform on Databricks.

As a newcomer to streaming systems, I have applied what I have learned about Spark Structured Streaming, Delta Lake, Auto Loader, and the Medallion Architecture to design and implement this solution.

This will be a small, hands-on data engineering project to get practical experience on the Databricks platform, using the sample NYC Taxi Trips dataset. The intension is to get started with something to play around and apply what's read in theory.

The project ingests data from file storage using Auto Loader into Bronze Delta tables, reads from bronze via Spark Structured Streaming, cleanses and normalizes the data into Silver Delta tables using spark, and applies aggregations to produce Gold Delta tables. The pipeline is orchestrated using Databricks Workflows, with insights visualized through dashboards built on queries against the Gold layer.

I have primarily used Databricks serverless compute, I did not explicitly create or manage clusters, feel free to create your own clusters and run the same Spark workloads to gain deeper insight into execution behavior, resource utilization, and performance characteristics using the Spark UI.

I have attached the source code git repo as well in the last post of this series. Keep scrolling and your feedbacks are most welcome.

Happy learning!!