Study Notes 2.2.4: Managing Scheduling and Backfills with Postgres in Kestra

#dataengineering #dezoomcamp #postgres #kestra

The session covers automation of data pipelines using schedules and backfills in Kestra.
Previously, a Postgres pipeline was created to:
- Load taxi data from a CSV file into a staging table.
- Merge it into a main table, ensuring no duplicate records.
The focus now is on automating this process with scheduled triggers and handling historical data gaps with backfills.

A schedule is implemented using triggers.
Instead of manually specifying the month and year, the trigger automatically determines these values.
The workflow includes:
- Two triggers set up at different times (9 AM and 10 AM on the 1st of each month).
- Configuration using a Cron expression.
- The schedule ensures automatic data retrieval as new data is available monthly from New York taxis.

Problem: The available data is from 2019 and 2020, but the pipeline is set up to run from 2025 onwards.
Solution: Use backfills to execute past-due schedules and fill in missing data.
Backfills allow running the pipeline for past dates by selecting a start date (Jan 1, 2019) and an end date (Dec 31, 2019).
Important considerations:
- The scheduled execution time (9 AM on the 1st of each month) must be adhered to.
- Setting a backfill label helps differentiate historical runs from real-time executions.
- Execution logs show which dataset (e.g., Green Taxi Data) is being backfilled.

By default, a single staging table is used, limiting execution to one month at a time.
To run multiple backfills simultaneously, consider:
- Creating separate staging tables for each month.
- Using date triggers to generate dynamic table names.
- Setting a concurrency limit of one to avoid conflicts.
Potential issues:
- Truncation conflicts: If multiple processes modify the same table, it can lead to data corruption.
- Large number of tables: Long-running backfills may create excessive tables, which may need to be dropped afterward.

After execution, verify data integrity by:
- Checking PG Admin for new tables (e.g., green_trip_data and green_trip_data_staging).
- Ensuring all 12 months of data have been processed correctly.
- Validating the row count (e.g., ~6 million rows for 2019).
- Confirming unique row IDs exist despite some missing vendor IDs.

Automating pipelines further:
- Using schedules to process new data as it arrives.
- Running backfills dynamically to fill in historical data.
Integrating DBT (Data Build Tool) with Postgres to improve workflow automation.
Upcoming lessons will focus on enhancing the efficiency and scalability of these automated workflows.

Kestra scheduling and backfills help automate data pipelines effectively.
Triggers allow data ingestion without manual intervention.
Backfills help populate missing historical data.
Proper concurrency management prevents conflicts in data processing.
Future improvements will involve DBT integration and more robust automation techniques.