DEV Community

Cover image for Study Notes 2.2.4: Managing Scheduling and Backfills with Postgres in Kestra
Pizofreude
Pizofreude

Posted on

Study Notes 2.2.4: Managing Scheduling and Backfills with Postgres in Kestra

Introduction

  • The session covers automation of data pipelines using schedules and backfills in Kestra.
  • Previously, a Postgres pipeline was created to:
    • Load taxi data from a CSV file into a staging table.
    • Merge it into a main table, ensuring no duplicate records.
  • The focus now is on automating this process with scheduled triggers and handling historical data gaps with backfills.

Scheduling in Kestra

  • A schedule is implemented using triggers.
  • Instead of manually specifying the month and year, the trigger automatically determines these values.
  • The workflow includes:
    • Two triggers set up at different times (9 AM and 10 AM on the 1st of each month).
    • Configuration using a Cron expression.
    • The schedule ensures automatic data retrieval as new data is available monthly from New York taxis.

Understanding Backfills

  • Problem: The available data is from 2019 and 2020, but the pipeline is set up to run from 2025 onwards.
  • Solution: Use backfills to execute past-due schedules and fill in missing data.
  • Backfills allow running the pipeline for past dates by selecting a start date (Jan 1, 2019) and an end date (Dec 31, 2019).
  • Important considerations:
    • The scheduled execution time (9 AM on the 1st of each month) must be adhered to.
    • Setting a backfill label helps differentiate historical runs from real-time executions.
    • Execution logs show which dataset (e.g., Green Taxi Data) is being backfilled.

Managing Concurrency and Staging Tables

  • By default, a single staging table is used, limiting execution to one month at a time.
  • To run multiple backfills simultaneously, consider:
    • Creating separate staging tables for each month.
    • Using date triggers to generate dynamic table names.
    • Setting a concurrency limit of one to avoid conflicts.
  • Potential issues:
    • Truncation conflicts: If multiple processes modify the same table, it can lead to data corruption.
    • Large number of tables: Long-running backfills may create excessive tables, which may need to be dropped afterward.

Verifying Backfill Execution

  • After execution, verify data integrity by:
    • Checking PG Admin for new tables (e.g., green_trip_data and green_trip_data_staging).
    • Ensuring all 12 months of data have been processed correctly.
    • Validating the row count (e.g., ~6 million rows for 2019).
    • Confirming unique row IDs exist despite some missing vendor IDs.

Future Enhancements

  • Automating pipelines further:
    • Using schedules to process new data as it arrives.
    • Running backfills dynamically to fill in historical data.
  • Integrating DBT (Data Build Tool) with Postgres to improve workflow automation.
  • Upcoming lessons will focus on enhancing the efficiency and scalability of these automated workflows.

Conclusion

  • Kestra scheduling and backfills help automate data pipelines effectively.
  • Triggers allow data ingestion without manual intervention.
  • Backfills help populate missing historical data.
  • Proper concurrency management prevents conflicts in data processing.
  • Future improvements will involve DBT integration and more robust automation techniques.

Image of Docusign

Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs