Module 2 Summary - Workflow Orchestration with Kestra Part 2

#automation #dataengineering #postgres #python

Part 2: Building ETL & ELT Data Pipelines

ETL = Extract → Transform → Load

The local pipeline workflow:

Key steps in the flow:

Dataset Source: NYC Taxi and Limousine Commission (TLC) Trip Record Data available in CSV format from the DataTalksClub GitHub repository.

Kestra provides powerful scheduling capabilities:

Schedule Trigger - Run pipelines at specific times (e.g., daily at 9 AM UTC)
Backfill - Process historical data by running workflows for past dates

Example: Backfill green taxi data for year 2019.

ELT = Extract → Load → Transform

When working with large datasets in the cloud, ELT is often preferred:

Step	Description
Extract	Get dataset from source (GitHub)
Load	Upload to data lake (Google Cloud Storage)
Transform	Create tables in data warehouse (BigQuery) using data from GCS

Advantage: Leverage cloud's performance for transforming large datasets much faster than local machines.

Required KV Store values: