DEV Community

Chen Debra
Chen Debra

Posted on

Part 5 | What Happens When Tasks Fail? A Complete Guide to Retry and Backfill in Apache DolphinScheduler

This article is the fifth installment of the series “Understanding Apache DolphinScheduler: From Scheduling Principles to DataOps Practices.” Using Apache DolphinScheduler as an example, it explains failure retry, manual rerun, and backfill mechanisms in scheduling systems, clarifies the meaning of Exactly Once semantics in scheduling, and summarizes common misuse scenarios and best practices to help build a stable and reliable data scheduling system.

In the daily operation of data platforms, task failures are almost inevitable. Network fluctuations, insufficient resources, downstream dependency failures, and code bugs can all cause scheduled tasks to fail. When failures occur, many teams rely on automatic retries, manual reruns, or backfill operations to recover the data pipeline.
However, an often overlooked fact is:

Failure retry, manual rerun, and backfill in scheduling systems actually have completely different semantics.
If these differences are not clearly understood, it can easily lead to duplicate data, data misalignment, or even data corruption. This article analyzes the design mechanisms of Apache DolphinScheduler to explain three of the most common but frequently misunderstood capabilities in scheduling systems: failure retry, manual rerun, and backfill, and further explores the real meaning of “Exactly Once” in scheduling systems.

1 Failure Retry vs Manual Rerun: Two Completely Different Recovery Mechanisms

In scheduling systems, failed tasks are usually recovered in two ways:

  1. Automatic Retry
  2. Manual Rerun Many people assume the only difference between them is how they are triggered. In reality, they are fundamentally different in terms of execution semantics.

1 Automatic Retry: Re-execution within the Same Instance

In Apache DolphinScheduler, every schedule generates a Workflow Instance, which contains multiple Task Instances.
When a task fails and Retry Times is configured, the system automatically retries the task within the same task instance.
Its characteristics include:

  • Belongs to the same workflow instance
  • Keeps the same Schedule Time
  • Dependency relationships remain unchanged
  • Only the failed task is re-executed Execution flow illustration:

The design goal of automatic retry is to handle:

Transient failures
For example:

  • Network fluctuations
  • Temporary resource shortages
  • Short-term unavailability of external systems In such cases, automatic retry can usually restore the task quickly.

2 Manual Rerun: Creating a New Instance

Unlike automatic retry, a manual rerun creates a new instance.
In Apache DolphinScheduler, users can choose to:

  • Rerun failed nodes
  • Rerun from the current node
  • Rerun the entire workflow from the beginning In these scenarios, the system generates a new Workflow Instance.

This means two instances may process data for the same logical time, and downstream tasks may write data repeatedly.

If tasks are not idempotent, this may lead to duplicate data issues.

2 Backfill and Data Recovery: Reconstructing Time in Scheduling Systems

In data warehouse scenarios, backfill is a very common operation. For example:

  • Backfilling historical data after creating a new task
  • Rerunning tasks for days when execution failed
  • Filling missing data due to upstream delays In Apache DolphinScheduler, backfill is typically performed using Backfill Run.

1 The Nature of Backfill: Creating Multiple Historical Instances

Assume a task runs daily.
Backfill range:

2025-03-01 → 2025-03-05
Enter fullscreen mode Exit fullscreen mode

The system will create multiple instances:

Instance (2025-03-01)
Instance (2025-03-02)
Instance (2025-03-03)
Instance (2025-03-04)
Instance (2025-03-05)
Enter fullscreen mode Exit fullscreen mode

Each instance has:

  • Independent execution status
  • Independent dependency relationships
  • Independent parameters The schedule time will be set to the corresponding historical time.

2 The Key to Backfill: Schedule Time vs Execution Time

In scheduling systems, two concepts are extremely important.
Schedule Time

Logical data time
Enter fullscreen mode Exit fullscreen mode

Execution Time

Actual task runtime
Enter fullscreen mode Exit fullscreen mode

Example:

Schedule Time : 2025-03-01
Execution Time: 2025-03-10
Enter fullscreen mode Exit fullscreen mode

If the SQL uses:

WHERE dt = ${schedule_time}
Enter fullscreen mode Exit fullscreen mode

Backfill is safe.
But if the SQL uses:

WHERE dt = today()
Enter fullscreen mode Exit fullscreen mode

Backfill will produce incorrect data.
This is also the root cause of many data quality issues.

3 Exactly Once in Scheduling Systems: What Does It Really Mean?

In stream processing systems such as Apache Flink, Exactly Once usually means:

Each record is processed only once.
However, in scheduling systems, Exactly Once has a completely different meaning.
A scheduling system cannot guarantee that tasks will not run multiple times, nor can it guarantee that data will not be written repeatedly. This is because automatic retries may re-execute tasks, manual reruns may re-execute tasks, and backfill may rerun historical logic.
Therefore, in scheduling systems, Exactly Once is closer to the idea that:
Only one logical instance is generated for the same schedule time.
But the task itself may still run multiple times.
Thus, true Exactly Once semantics must be guaranteed by idempotent task logic.
Common implementations include:

1 Overwrite Write

INSERT OVERWRITE TABLE
Enter fullscreen mode Exit fullscreen mode

2 Partition-based Writing

partition dt='${schedule_time}'
Enter fullscreen mode Exit fullscreen mode

3 Deduplicated Writing

MERGE INTO
Enter fullscreen mode Exit fullscreen mode

4 Common Misuse Scenarios

Many data incidents actually stem from misunderstandings of scheduling semantics.

1 Using Current Time as Data Date

Incorrect example:

dt = today()
Enter fullscreen mode Exit fullscreen mode

Correct approach:

dt = ${schedule_time}
Enter fullscreen mode Exit fullscreen mode

2 Non-idempotent Writes

Example:

INSERT INTO table
Enter fullscreen mode Exit fullscreen mode

If the task is rerun:

duplicate data will occur
Enter fullscreen mode Exit fullscreen mode

3 Manually Rerunning the Entire Workflow

Many users habitually do:

Failure → rerun from the beginning
Enter fullscreen mode Exit fullscreen mode

But a safer approach is:

rerun only the failed nodes
Enter fullscreen mode Exit fullscreen mode

5 Best Practice Recommendations

Based on experience using Apache DolphinScheduler, several important practices can be summarized.

1 Tasks Must Be Designed to Be Idempotent

All tasks should allow:

repeated execution
Enter fullscreen mode Exit fullscreen mode

without affecting data correctness.

2 Data Logic Must Be Based on Schedule Time

Avoid using:

now()
today()
Enter fullscreen mode Exit fullscreen mode

Always use:

${schedule_time}
Enter fullscreen mode Exit fullscreen mode

3 Use Retry Strategies Appropriately

Recommended configuration:

Retry Times: 1~3
Retry Interval: 1~5 min
Enter fullscreen mode Exit fullscreen mode

Avoid infinite retries.

4 Control Concurrency During Backfill

If the backfill range is too large:

a large number of instances may be generated at once
Enter fullscreen mode Exit fullscreen mode

which may cause:

  • scheduling queue congestion
  • cluster resource exhaustion Recommendation:
perform backfill in batches
Enter fullscreen mode Exit fullscreen mode

Conclusion

In data platforms, scheduling systems are often regarded as simple “task triggers.” In reality, they are responsible for time management, dependency control, and failure recovery.
By understanding the true semantics of failure retry, manual rerun, and backfill, we can build stable and reliable data production systems.
Modern scheduling systems, such as Apache DolphinScheduler, already provide powerful mechanisms. However, the ultimate factor determining data quality is still:

Correct understanding of scheduling semantics + idempotent data task design.
Only in this way can data platforms remain recoverable, traceable, and reconstructable even when failures occur.

Top comments (0)