Imagine your important package is returned. The courier doesn’t discard it but redelivers it a few hours later according to your instructions—until it succeeds or reaches the maximum number of attempts. The task retry mechanism of Apache DolphinScheduler works just like this “smart redelivery” in a scheduling system.
DolphinScheduler provides a comprehensive task failure retry mechanism. When a business task node fails during execution, it can automatically retry after a delay until it succeeds or reaches the maximum retry limit. This mechanism is implemented through a state machine, event publishing, and a delayed queue to ensure execution reliability.
Core Mechanism: Three-Step Retry Strategy
1. Automatically Trigger Retry on Failure
When a task (such as a Shell script) fails, the system immediately checks two key configurations:
-
failRetryTimes: Maximum number of retries (default is 0) -
failRetryInterval: Retry interval in minutes
These configurations are defined when creating the task. The frontend form converts them into JSON format for storage.
2. Intelligent Delay Calculation
The system does not retry immediately. Instead, it calculates the precise delay time:
// Actual delay = configured interval - (current time - task end time)
long remainingTime = TimeUnit.MINUTES.toMillis(delayTime)
+ System.currentTimeMillis()
- taskInstance.getEndTime().getTime();
This ensures that no matter when the task fails, the retry interval remains consistent.
3. State Machine–Driven Retry
The retry process is precisely controlled by a state machine:
- Failure State: After receiving a failure event, check whether retries are still allowed
-
Retry State: Create a new task instance (while keeping the original
firstSubmitTimeunchanged) - Re-execution: Publish a start event and place the task back into the scheduling queue
Key Features: Which Tasks Support Retry?
✅ Retry-Supported Business Nodes
- Shell script tasks
- SQL query tasks
- Spark computation tasks
- All tasks that execute actual code
❌ Non-Retryable Logical Nodes
- Conditional branch nodes
- Sub-process nodes
- Dependency check nodes
These nodes only control workflow logic and do not execute real code.
Practical Verification: Test Case
Integration testing demonstrates the full retry process:
- Task A fails on the first attempt (
retryTimes=0) - The system automatically retries once (
retryTimes=1) - Both task instances share the same
firstSubmitTime - After the final failure, the workflow stops
Special Scenario Handling
Dependency Waiting
If Task B depends on Task A, and Task A fails but has not yet reached the maximum retry limit, Task B will enter a waiting state instead of failing immediately.
Manual Intervention Capability
Even while a task is waiting for retry, you can:
- Pause the task (cancel subsequent retries)
- Terminate the task (force stop execution)
The system gracefully handles these interruption requests.
Fault Tolerance vs Retry: Key Differences
- Task Retry: Automatic retry after task code execution failure
- Worker Fault Tolerance: When a Worker server crashes, the Master takes over and reschedules the task (including Yarn jobs)
These are two distinct mechanisms.
Operational Recommendations
- Set reasonable retry intervals: Avoid overly frequent retries that increase system pressure
- Differentiate task types: Configure more retries for critical tasks
-
Monitor retry metrics: Track the
retryTimesfield to identify unstable tasks - Understand logical node limitations: Do not expect conditional branches to retry automatically
Summary
The retry mechanism of Apache DolphinScheduler is like a responsible dispatcher who never gives up easily when a task fails. Through precise state machine control, delay calculation, and event-driven architecture, it ensures reliable task execution in distributed environments.
Remember: only business nodes that execute actual code can benefit from this “redelivery” service. Logical nodes require manual intervention for exception handling.
Notes
- Retry intervals are measured in minutes; actual delay subtracts the time elapsed since task completion
- Each retry creates a new
TaskInstancerecord while preserving the originalfirstSubmitTime - Worker crash fault tolerance and task failure retry are two different mechanisms
- Logical nodes (such as sub-processes) do not support automatic retry and require manual handling

Top comments (0)