This month I attended the risingwave workshop hosted by datatalks for the 2024 cohort of the zoomcamp data engineering course.
Risingwave is an open-source, cloud-native streaming database system designed for real-time analytics and event-driven applications. With risingwave you can handle streaming data using sql-like queries, making it very powerful for scenarios like; real-time monitoring, anomaly detection, event-driven architectures, and continuous data transformation.
As a quick demo of how sql-like risingwave is, lets create a materialized view to compute the average, min and max trip time between taxi zone as well as find the top 10 maximum trip time for a route pair taking into consideration trip direction (i.e trip from a to b is different from trip from b to a).Data we are working with is the New York openly available yellow taxi data.
CREATE MATERIALIZED VIEW mv_avg_min_max_trip_time AS
SELECT
pu_zone.Zone AS pickup_zone,
do_zone.Zone AS dropoff_zone,
AVG(EXTRACT(EPOCH FROM (trip_data.tpep_dropoff_datetime - trip_data.tpep_pickup_datetime)) / 60) AS avg_trip_duration,
MIN(EXTRACT(EPOCH FROM (trip_data.tpep_dropoff_datetime - trip_data.tpep_pickup_datetime)) / 60) AS min_trip_duration,
MAX(EXTRACT(EPOCH FROM (trip_data.tpep_dropoff_datetime - trip_data.tpep_pickup_datetime)) / 60) AS max_trip_duration
FROM
trip_data
JOIN
taxi_zone pu_zone ON trip_data.PULocationID = pu_zone.location_id
JOIN
taxi_zone do_zone ON trip_data.DOLocationID = do_zone.location_id
GROUP BY
pu_zone.Zone,
do_zone.Zone;
SELECT
pickup_zone,
dropoff_zone,
max_trip_duration
FROM
mv_avg_min_max_trip_time
ORDER BY
max_trip_duration DESC
LIMIT 10;
Sources
datatalks risingwave workshop link
https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04/blob/main/workshop.md
Data talks data engineering course repo (week 6)
https://github.com/DataTalksClub/data-engineering-zoomcamp
Top comments (0)