DEV Community

Abdelrahman Adnan
Abdelrahman Adnan

Posted on

Module 2 Summary - Workflow Orchestration with Kestra Part 2

Part 2: Building ETL & ELT Data Pipelines

ETL Pipeline (Local Postgres)

ETL = Extract → Transform → Load

The local pipeline workflow:

  1. Extract CSV data from GitHub (partitioned by year and month)
  2. Transform data using Python
  3. Load data into PostgreSQL database

Key steps in the flow:

  • Create tables
  • Load data to monthly staging table
  • Merge data to final destination table

Dataset Source: NYC Taxi and Limousine Commission (TLC) Trip Record Data available in CSV format from the DataTalksClub GitHub repository.

Scheduling and Backfills

Kestra provides powerful scheduling capabilities:

  • Schedule Trigger - Run pipelines at specific times (e.g., daily at 9 AM UTC)
  • Backfill - Process historical data by running workflows for past dates

Example: Backfill green taxi data for year 2019.

ELT Pipeline (Google Cloud Platform)

ELT = Extract → Load → Transform

When working with large datasets in the cloud, ELT is often preferred:

Step Description
Extract Get dataset from source (GitHub)
Load Upload to data lake (Google Cloud Storage)
Transform Create tables in data warehouse (BigQuery) using data from GCS

Advantage: Leverage cloud's performance for transforming large datasets much faster than local machines.

GCP Setup for Kestra

Required KV Store values:

  • GCP_PROJECT_ID - Your Google Cloud project
  • GCP_LOCATION - Region for resources
  • GCP_BUCKET_NAME - GCS bucket name
  • GCP_DATASET - BigQuery dataset name
  • GCP_CREDS - Service account credentials (keep secure!)

GCP Pipeline Flow

  1. Extract CSV from GitHub
  2. Upload to Google Cloud Storage (data lake)
  3. Create external table in BigQuery from GCS
  4. Create partitioned table in BigQuery
  5. Schedule with timezone support (e.g., America/New_York)

dezoomccamp

Top comments (0)