Part 5 - Ingestion DAG and Raw Storage 📥
This part continues from the runtime config and looks at the first real Airflow DAG in the chain: dags/api_ingestion_dag.py.
What the DAG does
The ingestion DAG runs every three minutes. Its job is to:
- load the station sample,
- pick one or more stations for the current interval,
- fetch OpenAQ and OpenWeather payloads,
- save those payloads as raw JSON,
- and trigger the next DAG in the chain.
That is the point where the project stops being a bootstrap script and becomes a scheduled pipeline.
How station rotation works
Instead of hitting all stations every time, the DAG rotates through the sample using the current data interval. That gives the project a simple fairness mechanism:
- different stations are chosen on different runs,
- API usage is spread across the sample,
- and the same DAG can keep running without manual intervention.
This logic is handled in run_ingestion().
Raw storage layout
The helper save_to_storage() writes payloads using the same partition logic in both modes:
- local mode writes JSON into
local_data/raw/..., - cloud mode writes JSON into S3 under
raw/....
The directory structure is time-partitioned by year, month, day, and hour. That makes it easy for the Spark job to read a specific window later.
DAG to DAG orchestration
At the end of ingestion, the DAG uses TriggerDagRunOperator to start the transform DAG. That is a useful Airflow pattern because each stage can stay focused on one responsibility while still being chained in order.
Why this is a good learning example
This file demonstrates several pipeline ideas in a small space:
- scheduling,
- retry behavior,
- deterministic station rotation,
- raw-zone storage,
- and downstream triggering.
If you are learning Airflow, this is a good pattern to study because it keeps orchestration readable instead of turning the DAG into a giant script.
Continue
The next part zooms in on the API clients themselves so you can see how the project handles retries, normalization, and fallback behavior before data reaches the raw layer.
Continue to Part 6: API Client Design and Reliability.
Tag: #dataengineeringzoomcamp
Top comments (0)