Part 4 - Airflow Runtime and Shared Config ⚙️

#architecture #automation #dataengineering #python

Part 4 - Airflow Runtime and Shared Config ⚙️

This part continues from the bootstrap logic and explains the configuration layer that keeps the rest of the codebase portable.

The role of pipeline_config.py

The file dags/pipeline_config.py is the central runtime configuration module. It decides whether the project is running locally or in cloud mode and exposes the paths and credentials the other modules need.

That is a clean design because it avoids repeating environment logic in every DAG or script.

Local versus cloud behavior

The first important flag is PIPELINE_ENV. When it is set to local, the project uses:

local filesystem storage,
local parquet directories,
Dockerized Postgres,
and local Spark execution.

When it is not local, the same code paths shift toward:

S3 for raw and staging data,
AWS region-based clients,
and cloud runtime configuration such as SSM parameters.

Paths and partition helpers

The module also creates and manages the local data tree:

raw data under local_data/raw,
staging data under local_data/staging,
configuration under local_data/config,
and logs under local_data/logs.

Two helpers are especially important:

local_raw_path() builds the raw JSON file path by prefix, station, and timestamp.
local_staging_path() builds the parquet partition path in a year/month/day/hour layout.

Those helpers define the physical layout used by both the ingestion and transformation stages.

Why this module is worth copying in other projects

This file is small, but it is doing real platform work:

it standardizes runtime settings,
it creates expected directories early,
it keeps the path logic consistent,
and it reduces duplication across DAGs and scripts.

If you are building your own project, this is the kind of module that saves you time once the pipeline grows beyond a few files.

Next step

Now that the shared config is clear, the next article explains the ingestion DAG that uses it: how the pipeline fetches station data, stores raw JSON, and triggers the transformation job.

Continue to Part 5: Ingestion DAG and Raw Storage.

Tag: #dataengineeringzoomcamp