Part 6 - API Client Design and Reliability 🔁
This part continues from the ingestion DAG and explains the reusable client functions in dags/air_quality_fetchers.py.
Why the API layer is separated
Keeping API logic out of the DAG file makes the code easier to test and easier to reuse. The DAG can focus on scheduling and control flow while the fetcher module handles HTTP details.
That separation is a small design choice, but it matters when the project grows.
OpenWeather air quality data
The function fetch_openweather_air_quality() queries the OpenWeather air pollution endpoint using the station coordinates. It then reshapes the response into the ingestion format expected by downstream code.
That normalization step is important because the downstream Spark job expects a consistent structure, not a raw vendor payload.
Weather fallback behavior
The fetch_weather() function prefers the OpenWeather One Call API, but it falls back to the legacy weather endpoint when the primary request is unauthorized or unavailable.
That is a practical resilience pattern:
- try the richer endpoint first,
- fall back to a simpler endpoint,
- keep the payload shape stable after normalization.
Retry strategy
The module also configures a requests session with retry handling for transient failures such as:
- 429 rate limits,
- 500-level server errors,
- and similar temporary issues.
That means the ingestion layer is not just making one-off calls. It is designed to survive short-term API instability.
Why the normalized shape matters
The fetchers emit a payload that contains a results array with station id, sensor id, value, timestamp, and coordinates. That shape is intentionally simple so the Spark job and the raw storage layer can process it with minimal special handling.
Lesson from this module
The main lesson is that reliable ingestion is not only about calling an API. It is about shaping the response into something downstream systems can trust.
Continue
The next part moves into the transformation stage and shows how the same data becomes partitioned parquet through Spark, both locally and on EMR Serverless.
Continue to Part 7: Spark Transform Local vs Cloud.
Tag: #dataengineeringzoomcamp
Top comments (0)