Pipeline creation
Databricks workflow is created with each task doing each part discussed in this blog series. The entire pipeline is orchestrated to stream and process data incrementally.
- Bronze ingestion
- ZIP dimension build
- Silver enrichment
- Gold aggregation (both the tables)
Dependencies enforce order automatically. If you are interested you can schedule the pipeline as well as per need with simple cron expressions!
Dashboard Creation
Queries on the Gold tables feed data to Databricks dashboards.
In databricks workflow, create a your own dashboard and add custom queries to provide visual representation of business insights.
For example, to get the peak hours we add the below query as Data (from SQL) and create a tile in our dashboard to show the results fetched.
SELECT
trip_hour,
SUM(total_trips) AS trips
FROM nyc_taxi.gold.taxi_trip_metrics
GROUP BY trip_hour
And the result is,
You can keep adding tiles to beautify your dashboard!
Dashboards update automatically when:
- New files arrive
- Jobs rerun
- Late data is processed (within watermark)
To simulate - new data arrival, we can add extra data to the DBFS input file source.
You can play with tpep_pickup_datetime - to see watermarks dropping late data in action.
Messed up somewhere / want to reset the state ?
Reprocessing Strategy
To reprocess everything:
- Drop tables or schema
- Delete checkpoints
- Rerun workflow
Hope you liked the series, please do share your feedback.
The source code is available in the GitHub repository for reference.
That's all for now. See you soon!
Happy learning!





Top comments (0)