Part 8: Databricks Pipeline & Dashboard

#analytics #dataengineering #sql #tutorial

Pipeline creation

Databricks workflow is created with each task doing each part discussed in this blog series. The entire pipeline is orchestrated to stream and process data incrementally.

Bronze ingestion
ZIP dimension build
Silver enrichment
Gold aggregation (both the tables)

Dependencies enforce order automatically. If you are interested you can schedule the pipeline as well as per need with simple cron expressions!

Dashboard Creation

Queries on the Gold tables feed data to Databricks dashboards.

In databricks workflow, create a your own dashboard and add custom queries to provide visual representation of business insights.

For example, to get the peak hours we add the below query as Data (from SQL) and create a tile in our dashboard to show the results fetched.

SELECT
trip_hour,
SUM(total_trips) AS trips
FROM nyc_taxi.gold.taxi_trip_metrics
GROUP BY trip_hour

And the result is,

You can keep adding tiles to beautify your dashboard!

Dashboards update automatically when:

New files arrive
Jobs rerun
Late data is processed (within watermark)

To simulate - new data arrival, we can add extra data to the DBFS input file source.

You can play with tpep_pickup_datetime - to see watermarks dropping late data in action.

Messed up somewhere / want to reset the state ?

Reprocessing Strategy

To reprocess everything:

Drop tables or schema
Delete checkpoints
Rerun workflow

Hope you liked the series, please do share your feedback.
The source code is available in the GitHub repository for reference.

That's all for now. See you soon!

Happy learning!

DEV Community

Part 8: Databricks Pipeline & Dashboard

Pipeline creation

Dashboard Creation

Messed up somewhere / want to reset the state ?

Top comments (0)