DEV Community

Abdelrahman Adnan
Abdelrahman Adnan

Posted on

Part 8 - Staging Load into Postgres 🗃️

Part 8 - Staging Load into Postgres 🗃️

This part continues from the Spark transform and explains how the parquet output is loaded into PostgreSQL in dags/staging_load_dag.py.

What this DAG is responsible for

The staging load DAG takes the transformed parquet files and inserts them into the warehouse table that dbt will use as its source.

Its responsibilities are:

  • read the staging parquet for the requested run hour,
  • normalize a few timestamp columns,
  • infer reasonable PostgreSQL column types,
  • create the schema and table if needed,
  • bulk insert the data,
  • and trigger the dbt DAG afterward.

Local and cloud reads

The load code can read data from local parquet paths or from S3 using awswrangler. That mirrors the same local/cloud split used elsewhere in the project.

This is a good example of how to keep warehouse loading logic environment-agnostic.

Type preparation

The helper functions in this file convert dataframe columns into PostgreSQL-safe values. The code infers types such as:

  • BOOLEAN,
  • BIGINT,
  • DOUBLE PRECISION,
  • TIMESTAMP,
  • and TEXT.

That keeps the load process flexible without requiring a large manual schema file for the staging table.

Bulk insertion

Instead of inserting row by row, the code uses execute_values() from psycopg2. That is much faster and is the right approach for a batch warehouse load.

The target table is created in the airquality_dwh schema, and the inserted table is stg_air_quality.

Why this design works well for the tutorial

This step shows a simple but realistic warehouse loading pattern:

  • transform data into parquet first,
  • bulk load the warehouse from parquet,
  • then let dbt build the analytical models.

That separation is cleaner than trying to do everything inside one SQL script.

Continue

Next, I will explain the dbt project setup, including how the warehouse source is declared and how the model graph is organized.

Continue to Part 9: dbt Project Setup and Contracts.

Tag: #dataengineeringzoomcamp

Top comments (0)