Part 1 - Introduction and End-to-End Architecture

#architecture #data #dataengineering #tutorial

Part 1 - Introduction and End-to-End Architecture 🌍

This project was built as part of the Data Engineering Zoomcamp final project. The goal is simple to state but practical to implement: collect real environmental data for Egyptian cities, process it reliably, store it in a warehouse, and expose it in a dashboard that can be refreshed on a schedule.

This first part gives the big picture. The rest of the series will follow the actual code path step by step, so a reader can understand how each file in the repository fits into the pipeline.

What the project solves

Air quality is useful only when it is collected consistently and analyzed in context. Raw API payloads are not enough by themselves. They need to be:

fetched repeatedly from external sources,
stored in a predictable raw layout,
flattened into structured rows,
loaded into a warehouse,
modeled for analytics,
and presented in a dashboard that business users can read quickly.

That is exactly what this repository does.

The end-to-end flow

The project moves through four main layers:

Ingestion: Airflow fetches OpenAQ and OpenWeather data.
Raw storage: JSON payloads are written to the raw layer.
Transformation: Spark converts nested JSON into partitioned parquet.
Warehouse and analytics: Postgres and dbt create tables for Superset.

If you read the repository from top to bottom, the flow starts in dags/api_ingestion_dag.py and continues into dags/air_quality_transform_dag.py, dags/staging_load_dag.py, and dags/dbt_models_dag.py.

Why the stack looks like this

The architecture uses tools that mirror a real data engineering platform:

Apache Airflow for orchestration and dependency control.
Spark for heavy JSON flattening and structured output.
PostgreSQL for the warehouse layer.
dbt for modeled tables and data quality checks.
Superset for the final analytical interface.
Docker and Terraform for local and cloud reproducibility.

That combination is important because it shows the difference between a script and a pipeline. A script runs once. A pipeline can be scheduled, monitored, retried, and extended.

Local and cloud execution

A strong design choice in this repository is that the same code can run locally or in AWS. The environment switch happens in dags/pipeline_config.py. When PIPELINE_ENV=local, the project writes to the filesystem and uses local Docker services. When PIPELINE_ENV=cloud, it moves to S3, EMR Serverless, and cloud-managed runtime settings.

That design is a good learning point for a Zoomcamp project because it shows how to keep one codebase portable across development and production environments.

What to focus on next

In the next part, I will explain the data sources themselves: why OpenAQ and OpenWeather were chosen, what each API contributes, and how the station sample becomes the anchor that ties the whole pipeline together.

Continue to Part 2: Data Sources and Domain Model.