Part 2: Building ETL & ELT Data Pipelines
ETL Pipeline (Local Postgres)
ETL = Extract → Transform → Load
The local pipeline workflow:
- Extract CSV data from GitHub (partitioned by year and month)
- Transform data using Python
- Load data into PostgreSQL database
Key steps in the flow:
- Create tables
- Load data to monthly staging table
- Merge data to final destination table
Dataset Source: NYC Taxi and Limousine Commission (TLC) Trip Record Data available in CSV format from the DataTalksClub GitHub repository.
Scheduling and Backfills
Kestra provides powerful scheduling capabilities:
- Schedule Trigger - Run pipelines at specific times (e.g., daily at 9 AM UTC)
- Backfill - Process historical data by running workflows for past dates
Example: Backfill green taxi data for year 2019.
ELT Pipeline (Google Cloud Platform)
ELT = Extract → Load → Transform
When working with large datasets in the cloud, ELT is often preferred:
| Step | Description |
|---|---|
| Extract | Get dataset from source (GitHub) |
| Load | Upload to data lake (Google Cloud Storage) |
| Transform | Create tables in data warehouse (BigQuery) using data from GCS |
Advantage: Leverage cloud's performance for transforming large datasets much faster than local machines.
GCP Setup for Kestra
Required KV Store values:
-
GCP_PROJECT_ID- Your Google Cloud project -
GCP_LOCATION- Region for resources -
GCP_BUCKET_NAME- GCS bucket name -
GCP_DATASET- BigQuery dataset name -
GCP_CREDS- Service account credentials (keep secure!)
GCP Pipeline Flow
- Extract CSV from GitHub
- Upload to Google Cloud Storage (data lake)
- Create external table in BigQuery from GCS
- Create partitioned table in BigQuery
- Schedule with timezone support (e.g.,
America/New_York)
Top comments (0)