Chen Debra

Posted on Feb 27

Apache DolphinScheduler Task Failure Retry Mechanism: Intelligent Like Parcel Redelivery

#dolphinscheduler #opensource #datascience

Imagine your important package is returned. The courier doesn’t discard it but redelivers it a few hours later according to your instructions—until it succeeds or reaches the maximum number of attempts. The task retry mechanism of Apache DolphinScheduler works just like this “smart redelivery” in a scheduling system.

DolphinScheduler provides a comprehensive task failure retry mechanism. When a business task node fails during execution, it can automatically retry after a delay until it succeeds or reaches the maximum retry limit. This mechanism is implemented through a state machine, event publishing, and a delayed queue to ensure execution reliability.

Core Mechanism: Three-Step Retry Strategy

1. Automatically Trigger Retry on Failure

When a task (such as a Shell script) fails, the system immediately checks two key configurations:

failRetryTimes: Maximum number of retries (default is 0)
failRetryInterval: Retry interval in minutes

These configurations are defined when creating the task. The frontend form converts them into JSON format for storage.

2. Intelligent Delay Calculation

The system does not retry immediately. Instead, it calculates the precise delay time:

// Actual delay = configured interval - (current time - task end time)
long remainingTime = TimeUnit.MINUTES.toMillis(delayTime) 
    + System.currentTimeMillis() 
    - taskInstance.getEndTime().getTime();

This ensures that no matter when the task fails, the retry interval remains consistent.

3. State Machine–Driven Retry

The retry process is precisely controlled by a state machine:

Failure State: After receiving a failure event, check whether retries are still allowed
Retry State: Create a new task instance (while keeping the original firstSubmitTime unchanged)
Re-execution: Publish a start event and place the task back into the scheduling queue

Key Features: Which Tasks Support Retry?

✅ Retry-Supported Business Nodes

Shell script tasks
SQL query tasks
Spark computation tasks
All tasks that execute actual code

❌ Non-Retryable Logical Nodes

Conditional branch nodes
Sub-process nodes
Dependency check nodes

These nodes only control workflow logic and do not execute real code.

Practical Verification: Test Case

Integration testing demonstrates the full retry process:

Task A fails on the first attempt (retryTimes=0)
The system automatically retries once (retryTimes=1)
Both task instances share the same firstSubmitTime
After the final failure, the workflow stops

Special Scenario Handling

Dependency Waiting

If Task B depends on Task A, and Task A fails but has not yet reached the maximum retry limit, Task B will enter a waiting state instead of failing immediately.

Manual Intervention Capability

Even while a task is waiting for retry, you can:

Pause the task (cancel subsequent retries)
Terminate the task (force stop execution)

The system gracefully handles these interruption requests.

Fault Tolerance vs Retry: Key Differences

Task Retry: Automatic retry after task code execution failure
Worker Fault Tolerance: When a Worker server crashes, the Master takes over and reschedules the task (including Yarn jobs)

These are two distinct mechanisms.

Operational Recommendations

Set reasonable retry intervals: Avoid overly frequent retries that increase system pressure
Differentiate task types: Configure more retries for critical tasks
Monitor retry metrics: Track the retryTimes field to identify unstable tasks
Understand logical node limitations: Do not expect conditional branches to retry automatically

Summary

The retry mechanism of Apache DolphinScheduler is like a responsible dispatcher who never gives up easily when a task fails. Through precise state machine control, delay calculation, and event-driven architecture, it ensures reliable task execution in distributed environments.

Remember: only business nodes that execute actual code can benefit from this “redelivery” service. Logical nodes require manual intervention for exception handling.

Notes

Retry intervals are measured in minutes; actual delay subtracts the time elapsed since task completion
Each retry creates a new TaskInstance record while preserving the original firstSubmitTime
Worker crash fault tolerance and task failure retry are two different mechanisms
Logical nodes (such as sub-processes) do not support automatic retry and require manual handling

DEV Community