DEV Community

Abdelrahman Adnan
Abdelrahman Adnan

Posted on

Part 7 - Spark Transform Local vs Cloud ⚡

Part 7 - Spark Transform Local vs Cloud ⚡

This part continues from the API client layer and explains the transformation job in spark_jobs/air_quality_to_parquet.py.

What the Spark job does

The job reads raw OpenAQ and weather JSON, flattens nested structures, joins the datasets, and writes parquet into a staging layer partitioned by time.

That is the classic lakehouse-style move from raw JSON to structured analytics data.

Local versus cloud execution

The job can run in two different environments:

  • locally with SparkSession.builder.master("local[*]"),
  • or in the cloud through EMR Serverless.

The path resolution logic in resolve_paths() is what makes that possible. In local mode it reads and writes from the filesystem. In cloud mode it uses the bucket name pulled from SSM and points Spark to S3 locations instead.

Flattening the raw payloads

The Spark code expands nested arrays and structs to create a row-per-reading structure. The important pieces are:

  • air quality readings are exploded from the results array,
  • station metadata is exploded from the station sample,
  • weather fields are selected from the current conditions payload,
  • and the two data sets are joined on station and hour.

That join is where the project starts to look like an analytics pipeline instead of a raw ingestion job.

Schema stability

The job explicitly casts columns into stable types before writing parquet. That protects downstream consumers from schema drift and helps the warehouse load stay predictable.

This is a very useful lesson: in data engineering, the output contract is often more important than the implementation detail.

Why partitioning matters

The final write uses partitionBy("year", "month", "day", "hour"). That keeps the staging layer aligned with the raw layer and makes time-based reads efficient.

Continue

The next part explains how the staging parquet lands in PostgreSQL and how the pipeline keeps the warehouse tables available for dbt and Superset.

Continue to Part 8: Staging Load into Postgres.

Tag: #dataengineeringzoomcamp

Top comments (0)