DEV Community

Cover image for Part 3: Simulating Real-Time Streaming Data Using Databricks Sample Datasets
Nithyalakshmi Kamalakkannan
Nithyalakshmi Kamalakkannan

Posted on

Part 3: Simulating Real-Time Streaming Data Using Databricks Sample Datasets

We use the Databricks NYC Taxi sample dataset, available by default in Databricks.

This dataset is ideal because it includes:

  • Event timestamps (tpep_pickup_datetime)
  • Numeric measures (fare_amount, trip_distance)
  • Location attributes (pickup_zip, dropoff_zip)
  • Sufficient data volume to observe performance and shuffle behavior

Although the dataset is static, we will convert it into a streaming source.

Converting Static Data into a Streaming Source

Step 1: Read the Sample Dataset

df = spark.table("samples.nyctaxi.trips")

At this point, the data is a normal batch DataFrame.

Step 2: Write Data as JSON Files

To simulate streaming input, we write the dataset as JSON files to a directory:

(
df.write
.mode("overwrite")
.format("json")
.save("/tmp/taxi_stream_input")
)

This writes files to DBFS (Databricks File system - imagine it to be a virtual space provided by databricks) overwritting any files present earlier in the "/tmp/taxi_stream_input". Creates multiple JSON files. Each file represents a batch of incoming events

Now, the data is available as file storage for us to read and start the streaming!

Happy learning!

Top comments (0)