Part 8 - Staging Load into Postgres 🗃️
This part continues from the Spark transform and explains how the parquet output is loaded into PostgreSQL in dags/staging_load_dag.py.
What this DAG is responsible for
The staging load DAG takes the transformed parquet files and inserts them into the warehouse table that dbt will use as its source.
Its responsibilities are:
- read the staging parquet for the requested run hour,
- normalize a few timestamp columns,
- infer reasonable PostgreSQL column types,
- create the schema and table if needed,
- bulk insert the data,
- and trigger the dbt DAG afterward.
Local and cloud reads
The load code can read data from local parquet paths or from S3 using awswrangler. That mirrors the same local/cloud split used elsewhere in the project.
This is a good example of how to keep warehouse loading logic environment-agnostic.
Type preparation
The helper functions in this file convert dataframe columns into PostgreSQL-safe values. The code infers types such as:
- BOOLEAN,
- BIGINT,
- DOUBLE PRECISION,
- TIMESTAMP,
- and TEXT.
That keeps the load process flexible without requiring a large manual schema file for the staging table.
Bulk insertion
Instead of inserting row by row, the code uses execute_values() from psycopg2. That is much faster and is the right approach for a batch warehouse load.
The target table is created in the airquality_dwh schema, and the inserted table is stg_air_quality.
Why this design works well for the tutorial
This step shows a simple but realistic warehouse loading pattern:
- transform data into parquet first,
- bulk load the warehouse from parquet,
- then let dbt build the analytical models.
That separation is cleaner than trying to do everything inside one SQL script.
Continue
Next, I will explain the dbt project setup, including how the warehouse source is declared and how the model graph is organized.
Continue to Part 9: dbt Project Setup and Contracts.
Tag: #dataengineeringzoomcamp
Top comments (0)