This article is the fifth installment of the series “Understanding Apache DolphinScheduler: From Scheduling Principles to DataOps Practices.” Using Apache DolphinScheduler as an example, it explains failure retry, manual rerun, and backfill mechanisms in scheduling systems, clarifies the meaning of Exactly Once semantics in scheduling, and summarizes common misuse scenarios and best practices to help build a stable and reliable data scheduling system.
In the daily operation of data platforms, task failures are almost inevitable. Network fluctuations, insufficient resources, downstream dependency failures, and code bugs can all cause scheduled tasks to fail. When failures occur, many teams rely on automatic retries, manual reruns, or backfill operations to recover the data pipeline.
However, an often overlooked fact is:
Failure retry, manual rerun, and backfill in scheduling systems actually have completely different semantics.
If these differences are not clearly understood, it can easily lead to duplicate data, data misalignment, or even data corruption. This article analyzes the design mechanisms of Apache DolphinScheduler to explain three of the most common but frequently misunderstood capabilities in scheduling systems: failure retry, manual rerun, and backfill, and further explores the real meaning of “Exactly Once” in scheduling systems.
1 Failure Retry vs Manual Rerun: Two Completely Different Recovery Mechanisms
In scheduling systems, failed tasks are usually recovered in two ways:
- Automatic Retry
- Manual Rerun Many people assume the only difference between them is how they are triggered. In reality, they are fundamentally different in terms of execution semantics.
1 Automatic Retry: Re-execution within the Same Instance
In Apache DolphinScheduler, every schedule generates a Workflow Instance, which contains multiple Task Instances.
When a task fails and Retry Times is configured, the system automatically retries the task within the same task instance.
Its characteristics include:
- Belongs to the same workflow instance
- Keeps the same Schedule Time
- Dependency relationships remain unchanged
- Only the failed task is re-executed Execution flow illustration:
The design goal of automatic retry is to handle:
Transient failures
For example:
- Network fluctuations
- Temporary resource shortages
- Short-term unavailability of external systems In such cases, automatic retry can usually restore the task quickly.
2 Manual Rerun: Creating a New Instance
Unlike automatic retry, a manual rerun creates a new instance.
In Apache DolphinScheduler, users can choose to:
- Rerun failed nodes
- Rerun from the current node
- Rerun the entire workflow from the beginning In these scenarios, the system generates a new Workflow Instance.
This means two instances may process data for the same logical time, and downstream tasks may write data repeatedly.
If tasks are not idempotent, this may lead to duplicate data issues.
2 Backfill and Data Recovery: Reconstructing Time in Scheduling Systems
In data warehouse scenarios, backfill is a very common operation. For example:
- Backfilling historical data after creating a new task
- Rerunning tasks for days when execution failed
- Filling missing data due to upstream delays In Apache DolphinScheduler, backfill is typically performed using Backfill Run.
1 The Nature of Backfill: Creating Multiple Historical Instances
Assume a task runs daily.
Backfill range:
2025-03-01 → 2025-03-05
The system will create multiple instances:
Instance (2025-03-01)
Instance (2025-03-02)
Instance (2025-03-03)
Instance (2025-03-04)
Instance (2025-03-05)
Each instance has:
- Independent execution status
- Independent dependency relationships
- Independent parameters The schedule time will be set to the corresponding historical time.
2 The Key to Backfill: Schedule Time vs Execution Time
In scheduling systems, two concepts are extremely important.
Schedule Time
Logical data time
Execution Time
Actual task runtime
Example:
Schedule Time : 2025-03-01
Execution Time: 2025-03-10
If the SQL uses:
WHERE dt = ${schedule_time}
Backfill is safe.
But if the SQL uses:
WHERE dt = today()
Backfill will produce incorrect data.
This is also the root cause of many data quality issues.
3 Exactly Once in Scheduling Systems: What Does It Really Mean?
In stream processing systems such as Apache Flink, Exactly Once usually means:
Each record is processed only once.
However, in scheduling systems, Exactly Once has a completely different meaning.
A scheduling system cannot guarantee that tasks will not run multiple times, nor can it guarantee that data will not be written repeatedly. This is because automatic retries may re-execute tasks, manual reruns may re-execute tasks, and backfill may rerun historical logic.
Therefore, in scheduling systems, Exactly Once is closer to the idea that:
Only one logical instance is generated for the same schedule time.
But the task itself may still run multiple times.
Thus, true Exactly Once semantics must be guaranteed by idempotent task logic.
Common implementations include:
1 Overwrite Write
INSERT OVERWRITE TABLE
2 Partition-based Writing
partition dt='${schedule_time}'
3 Deduplicated Writing
MERGE INTO
4 Common Misuse Scenarios
Many data incidents actually stem from misunderstandings of scheduling semantics.
1 Using Current Time as Data Date
Incorrect example:
dt = today()
Correct approach:
dt = ${schedule_time}
2 Non-idempotent Writes
Example:
INSERT INTO table
If the task is rerun:
duplicate data will occur
3 Manually Rerunning the Entire Workflow
Many users habitually do:
Failure → rerun from the beginning
But a safer approach is:
rerun only the failed nodes
5 Best Practice Recommendations
Based on experience using Apache DolphinScheduler, several important practices can be summarized.
1 Tasks Must Be Designed to Be Idempotent
All tasks should allow:
repeated execution
without affecting data correctness.
2 Data Logic Must Be Based on Schedule Time
Avoid using:
now()
today()
Always use:
${schedule_time}
3 Use Retry Strategies Appropriately
Recommended configuration:
Retry Times: 1~3
Retry Interval: 1~5 min
Avoid infinite retries.
4 Control Concurrency During Backfill
If the backfill range is too large:
a large number of instances may be generated at once
which may cause:
- scheduling queue congestion
- cluster resource exhaustion Recommendation:
perform backfill in batches
Conclusion
In data platforms, scheduling systems are often regarded as simple “task triggers.” In reality, they are responsible for time management, dependency control, and failure recovery.
By understanding the true semantics of failure retry, manual rerun, and backfill, we can build stable and reliable data production systems.
Modern scheduling systems, such as Apache DolphinScheduler, already provide powerful mechanisms. However, the ultimate factor determining data quality is still:
Correct understanding of scheduling semantics + idempotent data task design.
Only in this way can data platforms remain recoverable, traceable, and reconstructable even when failures occur.
Previous articles:
Part 1 | Scheduling Systems Are More Than Just “Timers”
Part 2 | The Core Abstraction Model of Apache DolphinScheduler
Part 3 | How Scheduling Actually Runs
Part 4 | The State Machine: The Real Soul of Scheduling SystemsNext article preview:
Part 6 | Multi-Tenant and Resource Isolation Design in Apache DolphinScheduler

Top comments (0)