Chen Debra

Posted on Mar 13

Part 5 | What Happens When Tasks Fail? A Complete Guide to Retry and Backfill in Apache DolphinScheduler

#apachedolphinscheduler #ai #datascience #programming

This article is the fifth installment of the series “Understanding Apache DolphinScheduler: From Scheduling Principles to DataOps Practices.” Using Apache DolphinScheduler as an example, it explains failure retry, manual rerun, and backfill mechanisms in scheduling systems, clarifies the meaning of Exactly Once semantics in scheduling, and summarizes common misuse scenarios and best practices to help build a stable and reliable data scheduling system.

In the daily operation of data platforms, task failures are almost inevitable. Network fluctuations, insufficient resources, downstream dependency failures, and code bugs can all cause scheduled tasks to fail. When failures occur, many teams rely on automatic retries, manual reruns, or backfill operations to recover the data pipeline.
However, an often overlooked fact is:

Failure retry, manual rerun, and backfill in scheduling systems actually have completely different semantics.
If these differences are not clearly understood, it can easily lead to duplicate data, data misalignment, or even data corruption. This article analyzes the design mechanisms of Apache DolphinScheduler to explain three of the most common but frequently misunderstood capabilities in scheduling systems: failure retry, manual rerun, and backfill, and further explores the real meaning of “Exactly Once” in scheduling systems.

1 Failure Retry vs Manual Rerun: Two Completely Different Recovery Mechanisms

In scheduling systems, failed tasks are usually recovered in two ways:

Automatic Retry
Manual Rerun Many people assume the only difference between them is how they are triggered. In reality, they are fundamentally different in terms of execution semantics.

1 Automatic Retry: Re-execution within the Same Instance

In Apache DolphinScheduler, every schedule generates a Workflow Instance, which contains multiple Task Instances.
When a task fails and Retry Times is configured, the system automatically retries the task within the same task instance.
Its characteristics include:

Belongs to the same workflow instance
Keeps the same Schedule Time
Dependency relationships remain unchanged
Only the failed task is re-executed Execution flow illustration:

The design goal of automatic retry is to handle:

Transient failures
For example:

Network fluctuations
Temporary resource shortages
Short-term unavailability of external systems In such cases, automatic retry can usually restore the task quickly.

2 Manual Rerun: Creating a New Instance

Unlike automatic retry, a manual rerun creates a new instance.
In Apache DolphinScheduler, users can choose to:

Rerun failed nodes
Rerun from the current node
Rerun the entire workflow from the beginning In these scenarios, the system generates a new Workflow Instance.

This means two instances may process data for the same logical time, and downstream tasks may write data repeatedly.

If tasks are not idempotent, this may lead to duplicate data issues.

2 Backfill and Data Recovery: Reconstructing Time in Scheduling Systems

In data warehouse scenarios, backfill is a very common operation. For example:

Backfilling historical data after creating a new task
Rerunning tasks for days when execution failed
Filling missing data due to upstream delays In Apache DolphinScheduler, backfill is typically performed using Backfill Run.

1 The Nature of Backfill: Creating Multiple Historical Instances

Assume a task runs daily.
Backfill range:

2025-03-01 → 2025-03-05

The system will create multiple instances:

Instance (2025-03-01)
Instance (2025-03-02)
Instance (2025-03-03)
Instance (2025-03-04)
Instance (2025-03-05)

Each instance has:

Independent execution status
Independent dependency relationships
Independent parameters The schedule time will be set to the corresponding historical time.

2 The Key to Backfill: Schedule Time vs Execution Time

In scheduling systems, two concepts are extremely important.
Schedule Time

Logical data time

Execution Time

Actual task runtime

Example:

Schedule Time : 2025-03-01
Execution Time: 2025-03-10

If the SQL uses:

WHERE dt = ${schedule_time}

Backfill is safe.
But if the SQL uses:

WHERE dt = today()

Backfill will produce incorrect data.
This is also the root cause of many data quality issues.

3 Exactly Once in Scheduling Systems: What Does It Really Mean?

In stream processing systems such as Apache Flink, Exactly Once usually means:

Each record is processed only once.
However, in scheduling systems, Exactly Once has a completely different meaning.
A scheduling system cannot guarantee that tasks will not run multiple times, nor can it guarantee that data will not be written repeatedly. This is because automatic retries may re-execute tasks, manual reruns may re-execute tasks, and backfill may rerun historical logic.
Therefore, in scheduling systems, Exactly Once is closer to the idea that:
Only one logical instance is generated for the same schedule time.
But the task itself may still run multiple times.
Thus, true Exactly Once semantics must be guaranteed by idempotent task logic.
Common implementations include:

1 Overwrite Write

INSERT OVERWRITE TABLE

2 Partition-based Writing

partition dt='${schedule_time}'

3 Deduplicated Writing

MERGE INTO

4 Common Misuse Scenarios

Many data incidents actually stem from misunderstandings of scheduling semantics.

1 Using Current Time as Data Date

Incorrect example:

dt = today()

Correct approach:

dt = ${schedule_time}

2 Non-idempotent Writes

Example:

INSERT INTO table

If the task is rerun:

duplicate data will occur

3 Manually Rerunning the Entire Workflow

Many users habitually do:

Failure → rerun from the beginning

But a safer approach is:

rerun only the failed nodes

5 Best Practice Recommendations

Based on experience using Apache DolphinScheduler, several important practices can be summarized.

1 Tasks Must Be Designed to Be Idempotent

All tasks should allow:

repeated execution

without affecting data correctness.

2 Data Logic Must Be Based on Schedule Time

Avoid using:

now()
today()

Always use:

${schedule_time}

3 Use Retry Strategies Appropriately

Recommended configuration:

Retry Times: 1~3
Retry Interval: 1~5 min

Avoid infinite retries.

4 Control Concurrency During Backfill

If the backfill range is too large:

a large number of instances may be generated at once

which may cause:

scheduling queue congestion
cluster resource exhaustion Recommendation:

perform backfill in batches

Conclusion

In data platforms, scheduling systems are often regarded as simple “task triggers.” In reality, they are responsible for time management, dependency control, and failure recovery.
By understanding the true semantics of failure retry, manual rerun, and backfill, we can build stable and reliable data production systems.
Modern scheduling systems, such as Apache DolphinScheduler, already provide powerful mechanisms. However, the ultimate factor determining data quality is still:

Correct understanding of scheduling semantics + idempotent data task design.
Only in this way can data platforms remain recoverable, traceable, and reconstructable even when failures occur.

Previous articles:
Part 1 | Scheduling Systems Are More Than Just “Timers”
Part 2 | The Core Abstraction Model of Apache DolphinScheduler
Part 3 | How Scheduling Actually Runs
Part 4 | The State Machine: The Real Soul of Scheduling Systems
Next article preview:
Part 6 | Multi-Tenant and Resource Isolation Design in Apache DolphinScheduler

DEV Community