DEV Community

Abdelrahman Adnan
Abdelrahman Adnan

Posted on

Part 5 - Ingestion DAG and Raw Storage 📥

Part 5 - Ingestion DAG and Raw Storage 📥

This part continues from the runtime config and looks at the first real Airflow DAG in the chain: dags/api_ingestion_dag.py.

What the DAG does

The ingestion DAG runs every three minutes. Its job is to:

  1. load the station sample,
  2. pick one or more stations for the current interval,
  3. fetch OpenAQ and OpenWeather payloads,
  4. save those payloads as raw JSON,
  5. and trigger the next DAG in the chain.

That is the point where the project stops being a bootstrap script and becomes a scheduled pipeline.

How station rotation works

Instead of hitting all stations every time, the DAG rotates through the sample using the current data interval. That gives the project a simple fairness mechanism:

  • different stations are chosen on different runs,
  • API usage is spread across the sample,
  • and the same DAG can keep running without manual intervention.

This logic is handled in run_ingestion().

Raw storage layout

The helper save_to_storage() writes payloads using the same partition logic in both modes:

  • local mode writes JSON into local_data/raw/...,
  • cloud mode writes JSON into S3 under raw/....

The directory structure is time-partitioned by year, month, day, and hour. That makes it easy for the Spark job to read a specific window later.

DAG to DAG orchestration

At the end of ingestion, the DAG uses TriggerDagRunOperator to start the transform DAG. That is a useful Airflow pattern because each stage can stay focused on one responsibility while still being chained in order.

Why this is a good learning example

This file demonstrates several pipeline ideas in a small space:

  • scheduling,
  • retry behavior,
  • deterministic station rotation,
  • raw-zone storage,
  • and downstream triggering.

If you are learning Airflow, this is a good pattern to study because it keeps orchestration readable instead of turning the DAG into a giant script.

Continue

The next part zooms in on the API clients themselves so you can see how the project handles retries, normalization, and fallback behavior before data reaches the raw layer.

Continue to Part 6: API Client Design and Reliability.

Tag: #dataengineeringzoomcamp

Top comments (0)