Part 2 - Data Sources and Domain Model 📡

#api #dataengineering #python #tutorial

Part 2 - Data Sources and Domain Model 📡

This part continues directly from the architecture overview. Now that the overall flow is clear, the next question is what data this pipeline actually uses and how that data is shaped before it enters the warehouse.

The two external sources

The project combines two external APIs:

OpenAQ provides air quality station data and pollutant readings.
OpenWeather provides weather conditions for the same coordinates.

That pairing matters because air quality is more useful when read alongside weather signals such as temperature, humidity, wind, pressure, and cloud cover.

The API logic lives in dags/air_quality_fetchers.py. The project does not keep these calls inside the DAG files themselves. That is a good code organization choice because it keeps the orchestration layer simple and the data access layer reusable.

The station sample

The pipeline does not fetch all stations on every run. Instead, it builds a manageable sample of stations for Egypt and rotates through them over time. The sample is created by scripts/build_station_sample.py.

That script does three things:

It discovers OpenAQ locations for the configured country.
It enriches each station with country metadata.
It writes the selected station list to a shared JSON file.

The sample acts like the registry that the ingestion DAG reads from. That is why the project can rotate stations in a predictable way and still keep the pipeline small enough for a final project.

Why a sample is better than a full crawl here

A full OpenAQ crawl would introduce more complexity than the learning goal needs. Sampling gives you:

stable demo behavior,
lower API pressure,
easier debugging,
and clearer reproducibility for a tutorial.

In a real platform, you would likely add a more robust discovery layer, but for a teaching project this is a good tradeoff.

The domain model that emerges

From the raw APIs, the project builds a domain model centered on:

station_id,
sensor_id,
location coordinates,
pollutant value,
measurement timestamp,
and weather context.

These fields are enough to answer questions such as:

Which cities are reporting the highest PM2.5?
How does weather correlate with the readings?
Which stations are active and how fresh is the data?

That domain model is the reason the warehouse later splits into station, sensor, and fact tables.

What to read next

The next part goes inside the station sample builder itself. That script is a good example of how to turn a one-time bootstrap task into a reusable setup step for the rest of the pipeline.

Continue to Part 3: Station Sampling and Cache Building.

Tag: #dataengineeringzoomcamp