Part 3: Simulating Real-Time Streaming Data Using Databricks Sample Datasets

#dataengineering #python #tutorial

We use the Databricks NYC Taxi sample dataset, available by default in Databricks.

This dataset is ideal because it includes:

Event timestamps (tpep_pickup_datetime)
Numeric measures (fare_amount, trip_distance)
Location attributes (pickup_zip, dropoff_zip)
Sufficient data volume to observe performance and shuffle behavior

Although the dataset is static, we will convert it into a streaming source.

Converting Static Data into a Streaming Source

Step 1: Read the Sample Dataset

df = spark.table("samples.nyctaxi.trips")

At this point, the data is a normal batch DataFrame.

Step 2: Write Data as JSON Files

To simulate streaming input, we write the dataset as JSON files to a directory:

(
df.write
.mode("overwrite")
.format("json")
.save("/tmp/taxi_stream_input")
)

This writes files to DBFS (Databricks File system - imagine it to be a virtual space provided by databricks) overwritting any files present earlier in the "/tmp/taxi_stream_input". Creates multiple JSON files. Each file represents a batch of incoming events