DEV Community: Abdelrahman Adnan

Part 14 - Cloud Deployment and Lessons Learned ☁️

Abdelrahman Adnan — Tue, 21 Apr 2026 00:35:59 +0000

Part 14 - Cloud Deployment and Lessons Learned ☁️

This final part continues from the local deployment story and closes the loop with the cloud architecture in terraform/main.tf and terraform/user_data.sh.tftpl.

What Terraform provisions

The Terraform layer creates the cloud resources needed to run the project in AWS:

an S3 bucket for data lake storage,
an EC2 instance for Airflow and Superset,
an EMR Serverless application for Spark,
IAM roles and policies,
and SSM parameters that publish runtime configuration.

That is enough to reproduce the same pipeline outside a local Docker environment.

Why the EC2 bootstrap matters

The user data template clones the repository, writes the environment file, and starts Docker Compose on the instance. That keeps the cloud setup aligned with the local development workflow.

The result is a single codebase that can run in two environments with minimal friction.

What this project teaches

This repository is a good Zoomcamp final project because it demonstrates the main ideas of modern data engineering:

ingestion from external APIs,
raw zone storage,
batch transformation with Spark,
warehouse loading,
dbt modeling,
dashboard automation,
and cloud infrastructure as code.

A few lessons from the implementation

A few practical lessons stand out:

keep environment-specific logic in one config layer,
isolate API clients from orchestration code,
use partitioned storage for time-based data,
model analytics tables in dbt instead of ad hoc SQL,
and automate the dashboard so the final result can be reproduced.

Closing note

This final article closes the tutorial series. If someone reads the 14 parts in order, they should be able to understand the entire project from raw data collection to dashboard delivery.

This series is now ready to publish as a continuous learning path, and the data-engineering-zoomcamp tag appears in every article so the set stays grouped together.

Tag: #dataengineeringzoomcamp

Part 13 - Local Development and Docker Compose 🐳

Abdelrahman Adnan — Tue, 21 Apr 2026 00:35:18 +0000

Part 13 - Local Development and Docker Compose 🐳

This part continues from the Superset automation and explains how the repository is meant to run locally.

The local stack

The local environment is defined in docker-compose.yml. It brings up:

PostgreSQL,
Airflow webserver,
Airflow scheduler,
Airflow initialization,
Superset initialization,
and the Superset web server.

That gives the project a full end-to-end development stack without needing AWS.

Why Docker Compose is useful here

Docker Compose makes the project easier to understand because every service is declared in one place. A reader can see immediately how Airflow connects to PostgreSQL and how Superset depends on the warehouse.

Makefile workflow

The Makefile provides short commands for common actions such as:

make install,
make local-up,
make local-init,
make generate-egypt-stations,
make demo-run.

That is good developer ergonomics even if some documentation still needs cleanup.

What to notice in the local setup

The project makes a few decisions that are worth learning from:

the same code works locally and in cloud mode,
the local container mounts the DAGs, scripts, and spark jobs,
the environment file drives runtime values,
and the demo run can be triggered after the stack is ready.

Continue

The final part explains the Terraform deployment, the EC2 bootstrap flow, and the lessons learned from building the project as a Zoomcamp final project.

Continue to Part 14: Cloud Deployment and Lessons Learned.

Tag: #dataengineeringzoomcamp

Part 12 - Superset Seeding and Dashboards 🎛️

Abdelrahman Adnan — Tue, 21 Apr 2026 00:34:47 +0000

Part 12 - Superset Seeding and Dashboards 🎛️

This part continues from the warehouse models and explains scripts/seed_superset_dashboard.py.

Why this script exists

The script automates the Superset setup so the dashboard can be recreated consistently instead of being assembled manually in the UI.

That matters for a tutorial project because the dashboard becomes part of the codebase, not just a one-time hand-built artifact.

Waiting for the warehouse

The first important behavior is wait_for_warehouse(). The script checks that the dbt tables exist and that the fact table contains data before seeding the dashboard.

That avoids a common failure mode where a dashboard points to empty tables or missing datasets.

Datasets and charts

The script creates:

a physical dataset for dim_station,
multiple virtual SQL datasets for dashboard views,
and a set of charts that are attached to those datasets.

The virtual datasets are especially useful because they encode the exact business questions the dashboard is trying to answer.

Dashboard layout

The build_dashboard_layout() function constructs the nested layout structure that Superset expects. Then ensure_dashboard() replaces or recreates the dashboard so the final result is deterministic.

That is a neat pattern for automation because it keeps the dashboard in sync with the code.

What to learn from this file

This script is a strong example of how to treat analytics delivery as code:

build datasets from warehouse tables,
define charts in Python,
assemble the dashboard layout programmatically,
and keep the output repeatable.

Continue

The next part turns to the local developer experience: Docker Compose, the Makefile, and how the project is meant to be run on a laptop.

Continue to Part 13: Local Development and Docker Compose.

Tag: #dataengineeringzoomcamp

Part 11 - Dimensions and Fact Table 📊

Abdelrahman Adnan — Tue, 21 Apr 2026 00:34:23 +0000

Part 11 - Dimensions and Fact Table 📊

This part continues from the base model and explains the mart layer in dags/air_quality_dbt/models/marts/.

Why the mart layer exists

The mart layer is where the analytics shape becomes obvious. Instead of keeping everything in one large staging table, the project splits the data into a star-schema-style layout.

That makes downstream querying simpler and more efficient.

The dimension tables

The project creates two dimensions:

dim_station from dim_station.sql,
dim_sensor from dim_sensor.sql.

These are deduplicated reference tables. Each one extracts the stable descriptive fields that are useful for analysis and dashboard filtering.

The fact table

The main analytical table is fact_air_quality, defined in fact_air_quality.sql.

This table keeps the actual readings and the relevant weather context in one place. That is why the dashboard queries can stay straightforward.

Why the star schema is useful here

The warehouse design intentionally denormalizes some location fields into the fact table. That reduces the number of joins needed for common dashboard questions while still keeping clean dimension tables for reference and filtering.

For a tutorial project, this is a solid middle ground between realism and simplicity.

Tests on the mart layer

The mart schema tests check the key constraints:

station_id should be unique and not null in dim_station,
sensor_id should be unique and not null in dim_sensor,
station_id and sensor_id should not be null in fact_air_quality.

That keeps the model graph honest.

Continue

The next part moves from warehouse modeling into the visualization layer and explains how Superset is seeded with datasets, charts, and a dashboard layout.

Continue to Part 12: Superset Seeding and Dashboards.

Tag: #dataengineeringzoomcamp

Part 10 - Base Model and Data Quality ✅

Abdelrahman Adnan — Tue, 21 Apr 2026 00:33:58 +0000

Part 10 - Base Model and Data Quality ✅

This part continues from the dbt setup and looks at the base layer in dags/air_quality_dbt/models/base/base_air_quality.sql and dags/air_quality_dbt/models/base/schema.yml.

The base model

The base model is a view over the staging table. It selects the core fields needed by downstream models:

station identifiers,
sensor identifiers,
measurement values,
coordinates,
weather context,
and time partitions.

This layer is the place where the project standardizes the source before turning it into marts.

Why a view makes sense here

A base view is a good fit because it avoids copying data unnecessarily while still giving dbt a clean object to reference.

That means the warehouse load handles physical persistence, and dbt handles logical modeling.

Data quality checks

The schema file adds simple but important tests:

station_id should not be null,
sensor_id should not be null,
target_country_name should not be null.

These tests are not complicated, but they help catch broken ingestion or malformed records early.

What this teaches

This is a good example of how data quality in dbt starts small:

declare the source,
expose a clean base model,
assert the essential keys,
and let the downstream marts depend on the trusted layer.

Continue

The next part explains the mart tables themselves and shows how the project separates stations, sensors, and the final fact table.

Continue to Part 11: Dimensions and Fact Table.

Tag: #dataengineeringzoomcamp

Part 9 - dbt Project Setup and Contracts 🧱

Abdelrahman Adnan — Tue, 21 Apr 2026 00:33:36 +0000

Part 9 - dbt Project Setup and Contracts 🧱

This part continues from the warehouse load and looks at the dbt project under dags/air_quality_dbt/.

What dbt is doing here

In this repository, dbt is the modeling layer that turns the loaded staging table into structured analytics tables.

The main project file, dbt_project.yml, defines the model folders and default materializations:

base models become views,
mart models become tables.

That split is intentional and easy to reason about.

The dbt profile

The profile in profiles.yml connects dbt to PostgreSQL using environment variables. That means the same project works in a containerized local environment and in a cloud runtime where the connection values are injected differently.

The source contract

The base model starts from the source airquality_dwh.stg_air_quality. That source declaration creates a clear contract: dbt expects the warehouse load step to create and populate the staging table before modeling begins.

This is a useful teaching point because it shows how data contracts are formed in practice.

Model layers

The project is split into:

base models that clean or standardize the source,
mart models that reshape the data for analysis,
and schema tests that validate important fields.

That organization matches the way many production dbt projects are structured.

Continue

In the next part, I will walk through the base model and schema tests so you can see how the loaded warehouse table becomes a clean dbt source for the marts.

Continue to Part 10: Base Model and Data Quality.

Tag: #dataengineeringzoomcamp

Part 8 - Staging Load into Postgres 🗃️

Abdelrahman Adnan — Tue, 21 Apr 2026 00:33:11 +0000

Part 8 - Staging Load into Postgres 🗃️

This part continues from the Spark transform and explains how the parquet output is loaded into PostgreSQL in dags/staging_load_dag.py.

What this DAG is responsible for

The staging load DAG takes the transformed parquet files and inserts them into the warehouse table that dbt will use as its source.

Its responsibilities are:

read the staging parquet for the requested run hour,
normalize a few timestamp columns,
infer reasonable PostgreSQL column types,
create the schema and table if needed,
bulk insert the data,
and trigger the dbt DAG afterward.

Local and cloud reads

The load code can read data from local parquet paths or from S3 using awswrangler. That mirrors the same local/cloud split used elsewhere in the project.

This is a good example of how to keep warehouse loading logic environment-agnostic.

Type preparation

The helper functions in this file convert dataframe columns into PostgreSQL-safe values. The code infers types such as:

BOOLEAN,
BIGINT,
DOUBLE PRECISION,
TIMESTAMP,
and TEXT.

That keeps the load process flexible without requiring a large manual schema file for the staging table.

Bulk insertion

Instead of inserting row by row, the code uses execute_values() from psycopg2. That is much faster and is the right approach for a batch warehouse load.

The target table is created in the airquality_dwh schema, and the inserted table is stg_air_quality.

Why this design works well for the tutorial

This step shows a simple but realistic warehouse loading pattern:

transform data into parquet first,
bulk load the warehouse from parquet,
then let dbt build the analytical models.

That separation is cleaner than trying to do everything inside one SQL script.

Continue

Next, I will explain the dbt project setup, including how the warehouse source is declared and how the model graph is organized.

Continue to Part 9: dbt Project Setup and Contracts.

Tag: #dataengineeringzoomcamp

Part 7 - Spark Transform Local vs Cloud ⚡

Abdelrahman Adnan — Tue, 21 Apr 2026 00:32:24 +0000

Part 7 - Spark Transform Local vs Cloud ⚡

This part continues from the API client layer and explains the transformation job in spark_jobs/air_quality_to_parquet.py.

What the Spark job does

The job reads raw OpenAQ and weather JSON, flattens nested structures, joins the datasets, and writes parquet into a staging layer partitioned by time.

That is the classic lakehouse-style move from raw JSON to structured analytics data.

Local versus cloud execution

The job can run in two different environments:

locally with SparkSession.builder.master("local[*]"),
or in the cloud through EMR Serverless.

The path resolution logic in resolve_paths() is what makes that possible. In local mode it reads and writes from the filesystem. In cloud mode it uses the bucket name pulled from SSM and points Spark to S3 locations instead.

Flattening the raw payloads

The Spark code expands nested arrays and structs to create a row-per-reading structure. The important pieces are:

air quality readings are exploded from the results array,
station metadata is exploded from the station sample,
weather fields are selected from the current conditions payload,
and the two data sets are joined on station and hour.

That join is where the project starts to look like an analytics pipeline instead of a raw ingestion job.

Schema stability

The job explicitly casts columns into stable types before writing parquet. That protects downstream consumers from schema drift and helps the warehouse load stay predictable.

This is a very useful lesson: in data engineering, the output contract is often more important than the implementation detail.

Why partitioning matters

The final write uses partitionBy("year", "month", "day", "hour"). That keeps the staging layer aligned with the raw layer and makes time-based reads efficient.

Continue

The next part explains how the staging parquet lands in PostgreSQL and how the pipeline keeps the warehouse tables available for dbt and Superset.

Continue to Part 8: Staging Load into Postgres.

Tag: #dataengineeringzoomcamp

Part 6 - API Client Design and Reliability 🔁

Abdelrahman Adnan — Tue, 21 Apr 2026 00:31:48 +0000

Part 6 - API Client Design and Reliability 🔁

This part continues from the ingestion DAG and explains the reusable client functions in dags/air_quality_fetchers.py.

Why the API layer is separated

Keeping API logic out of the DAG file makes the code easier to test and easier to reuse. The DAG can focus on scheduling and control flow while the fetcher module handles HTTP details.

That separation is a small design choice, but it matters when the project grows.

OpenWeather air quality data

The function fetch_openweather_air_quality() queries the OpenWeather air pollution endpoint using the station coordinates. It then reshapes the response into the ingestion format expected by downstream code.

That normalization step is important because the downstream Spark job expects a consistent structure, not a raw vendor payload.

Weather fallback behavior

The fetch_weather() function prefers the OpenWeather One Call API, but it falls back to the legacy weather endpoint when the primary request is unauthorized or unavailable.

That is a practical resilience pattern:

try the richer endpoint first,
fall back to a simpler endpoint,
keep the payload shape stable after normalization.

Retry strategy

The module also configures a requests session with retry handling for transient failures such as:

429 rate limits,
500-level server errors,
and similar temporary issues.

That means the ingestion layer is not just making one-off calls. It is designed to survive short-term API instability.

Why the normalized shape matters

The fetchers emit a payload that contains a results array with station id, sensor id, value, timestamp, and coordinates. That shape is intentionally simple so the Spark job and the raw storage layer can process it with minimal special handling.

Lesson from this module

The main lesson is that reliable ingestion is not only about calling an API. It is about shaping the response into something downstream systems can trust.

Continue

The next part moves into the transformation stage and shows how the same data becomes partitioned parquet through Spark, both locally and on EMR Serverless.

Continue to Part 7: Spark Transform Local vs Cloud.

Tag: #dataengineeringzoomcamp

Part 5 - Ingestion DAG and Raw Storage 📥

Abdelrahman Adnan — Tue, 21 Apr 2026 00:31:23 +0000

Part 5 - Ingestion DAG and Raw Storage 📥

This part continues from the runtime config and looks at the first real Airflow DAG in the chain: dags/api_ingestion_dag.py.

What the DAG does

The ingestion DAG runs every three minutes. Its job is to:

load the station sample,
pick one or more stations for the current interval,
fetch OpenAQ and OpenWeather payloads,
save those payloads as raw JSON,
and trigger the next DAG in the chain.

That is the point where the project stops being a bootstrap script and becomes a scheduled pipeline.

How station rotation works

Instead of hitting all stations every time, the DAG rotates through the sample using the current data interval. That gives the project a simple fairness mechanism:

different stations are chosen on different runs,
API usage is spread across the sample,
and the same DAG can keep running without manual intervention.

This logic is handled in run_ingestion().

Raw storage layout

The helper save_to_storage() writes payloads using the same partition logic in both modes:

local mode writes JSON into local_data/raw/...,
cloud mode writes JSON into S3 under raw/....

The directory structure is time-partitioned by year, month, day, and hour. That makes it easy for the Spark job to read a specific window later.

DAG to DAG orchestration

At the end of ingestion, the DAG uses TriggerDagRunOperator to start the transform DAG. That is a useful Airflow pattern because each stage can stay focused on one responsibility while still being chained in order.

Why this is a good learning example

This file demonstrates several pipeline ideas in a small space:

scheduling,
retry behavior,
deterministic station rotation,
raw-zone storage,
and downstream triggering.

If you are learning Airflow, this is a good pattern to study because it keeps orchestration readable instead of turning the DAG into a giant script.

Continue

The next part zooms in on the API clients themselves so you can see how the project handles retries, normalization, and fallback behavior before data reaches the raw layer.

Continue to Part 6: API Client Design and Reliability.

Tag: #dataengineeringzoomcamp

Part 4 - Airflow Runtime and Shared Config ⚙️

Abdelrahman Adnan — Tue, 21 Apr 2026 00:30:47 +0000

Part 4 - Airflow Runtime and Shared Config ⚙️

This part continues from the bootstrap logic and explains the configuration layer that keeps the rest of the codebase portable.

The role of pipeline_config.py

The file dags/pipeline_config.py is the central runtime configuration module. It decides whether the project is running locally or in cloud mode and exposes the paths and credentials the other modules need.

That is a clean design because it avoids repeating environment logic in every DAG or script.

Local versus cloud behavior

The first important flag is PIPELINE_ENV. When it is set to local, the project uses:

local filesystem storage,
local parquet directories,
Dockerized Postgres,
and local Spark execution.

When it is not local, the same code paths shift toward:

S3 for raw and staging data,
AWS region-based clients,
and cloud runtime configuration such as SSM parameters.

Paths and partition helpers

The module also creates and manages the local data tree:

raw data under local_data/raw,
staging data under local_data/staging,
configuration under local_data/config,
and logs under local_data/logs.

Two helpers are especially important:

local_raw_path() builds the raw JSON file path by prefix, station, and timestamp.
local_staging_path() builds the parquet partition path in a year/month/day/hour layout.

Those helpers define the physical layout used by both the ingestion and transformation stages.

Why this module is worth copying in other projects

This file is small, but it is doing real platform work:

it standardizes runtime settings,
it creates expected directories early,
it keeps the path logic consistent,
and it reduces duplication across DAGs and scripts.

If you are building your own project, this is the kind of module that saves you time once the pipeline grows beyond a few files.

Next step

Now that the shared config is clear, the next article explains the ingestion DAG that uses it: how the pipeline fetches station data, stores raw JSON, and triggers the transformation job.

Continue to Part 5: Ingestion DAG and Raw Storage.

Tag: #dataengineeringzoomcamp

Part 3 - Station Sampling and Cache Building 🗂️

Abdelrahman Adnan — Tue, 21 Apr 2026 00:29:41 +0000

Part 3 - Station Sampling and Cache Building 🗂️

This part continues from the data source overview and focuses on the bootstrap script that prepares the station list used by ingestion.

Why this script exists

The pipeline needs a stable list of stations before Airflow starts fetching readings. Rather than hard-coding stations manually, scripts/build_station_sample.py discovers them from OpenAQ and stores the result in a cached JSON file.

That gives the project a real-world bootstrap pattern:

discover reference data,
normalize it,
cache it,
and reuse it from the DAG.

How the script works

The script is organized into a few focused functions:

resolve_country_id() finds the OpenAQ country id for Egypt.
_fetch_locations() retrieves station records with retry handling.
enrich_station_sample() adds normalized country metadata.
load_or_fetch_station_sample() prefers the local cache when it is already valid.
save_to_storage() writes the sample either to local disk or to S3.

That structure is easy to follow because each function has one responsibility.

The caching behavior

Caching is important here because the station set does not need to be rebuilt every time the pipeline runs. The script checks whether a local cache already exists and whether it is large enough. If the cache is too small, it is discarded and rebuilt.

This is a small but useful pattern to study:

bootstrap data is cached,
stale cache can be refreshed,
and the rest of the pipeline depends on a predictable reference file.

Why this matters for the rest of the pipeline

The ingestion DAG reads the sample from the same location every time. That keeps the flow deterministic. It also means the downstream Spark job can read the same station metadata file and join readings back to the same station definitions.

In other words, this script is not just a setup helper. It is part of the data contract for the whole repository.

Continue

Next, I will explain the shared configuration module and show how one file controls local paths, environment selection, and warehouse connection settings across the project.

Continue to Part 4: Airflow Runtime and Shared Config.

Tag: #dataengineeringzoomcamp