Chen Debra

Posted on Mar 5

Part 4 | Why State Machines Power Reliable Scheduling Systems

#architecture #dataengineering #opensource #systemdesign

Introduction: As data platforms continue to evolve, scheduling systems are no longer merely tools for “running tasks on a schedule.” Instead, they have become the central hub that carries complex dependencies and system reliability. The column “Understanding Apache DolphinScheduler: From Scheduling Principles to DataOps Practice” attempts to break down the key design ideas behind scheduling from real engineering scenarios.
As the fourth article in the series, this piece focuses on the state machine mechanism in Apache DolphinScheduler, helping you understand how a scheduling system maintains order and reliability in an environment full of uncertainty.

Why Scheduling Systems Must Rely on State Machines

Among all the core mechanisms of a scheduling system, what truly determines reliability is never the UI, thread pools, or distributed frameworks—it is the state machine.

As long as a system needs to execute across nodes, tolerate failures, support retries, allow manual intervention, and automatically recover after exceptions, it must be designed around state transitions. Only by understanding this can we truly grasp the complexity of scheduling systems.

In Apache DolphinScheduler, TaskInstance and WorkflowInstance are not merely execution objects. They are actually two nested state machines.

The operation of the scheduling system is essentially not about executing tasks, but about pushing states forward.

Why Scheduling Systems Depend on State Machines

The biggest difference between a scheduling system and a regular program is that its execution process has a long lifecycle and inherent uncertainty.

A task may run for hours. During execution, it may encounter worker crashes, master failovers, network jitter, temporary database outages, or even manual termination and suspension.

If the system relied only on execution context stored in memory, once the process crashes, all information would be lost.

Therefore, scheduling systems must externalize the current progress into a persistent state. The status fields stored in the database become the real execution reference.

After a Master node restarts, it does not “remember” what was running before. Instead, it rescans the database and determines:

which tasks need to be rescheduled,
which have already finished,
and which require fault recovery.

This is the essence of the state machine principle:

Execution logic should not rely on memory—it should rely on persistent state transitions.

TaskInstance State Transition Model

In DolphinScheduler, the design of TaskExecutionStatus is not simply about success or failure.
From creation to completion, a task typically goes through the following states:

public enum TaskExecutionStatus {
    SUBMITTED_SUCCESS,
    DISPATCH,
    RUNNING,
    SUCCESS,
    FAILURE,
    NEED_FAULT_TOLERANCE,
    KILL,
    KILL_SUCCESS,
    PAUSE,
    STOP,
    WAITING_THREAD,
    DELAY_EXECUTION
}

A basic successful path looks like this:

SUBMITTED_SUCCESS → DISPATCH → RUNNING → SUCCESS

However, this is only the ideal path. In distributed environments, paths involving fault tolerance are more common.

For example, when a Worker is lost or execution fails:

RUNNING → NEED_FAULT_TOLERANCE → SUBMITTED_SUCCESS → DISPATCH → RUNNING

This path means that after entering a fault-tolerance state, the task is resubmitted, rather than simply marked as failed.

The state itself defines what the system should do next.

On the Master side, scheduling logic is essentially state-driven:

public void submitTask(TaskInstance taskInstance) {
    if (taskInstance.getState() == TaskExecutionStatus.SUBMITTED_SUCCESS) {
        dispatchToWorker(taskInstance);
        taskInstance.setState(TaskExecutionStatus.DISPATCH);
        updateTaskState(taskInstance);
    }
}

When the Worker executes a task, it also expresses progress through state changes:

public void executeTask(TaskInstance taskInstance) {
    taskInstance.setState(TaskExecutionStatus.RUNNING);
    updateTaskState(taskInstance);

    try {
        runTaskLogic(taskInstance);
        taskInstance.setState(TaskExecutionStatus.SUCCESS);
    } catch (Exception e) {
        taskInstance.setState(TaskExecutionStatus.FAILURE);
    }

    updateTaskState(taskInstance);
}

The key point is not the execution logic itself, but the rule that every state change must be persisted.

The state stored in the database is the single source of truth for the entire system.

WorkflowInstance: A Composite State Machine

If TaskInstance is an atomic state machine, then WorkflowInstance is a composite state machine.

The state of a workflow does not exist independently. Instead, it is essentially a function of all child task states.

During execution, the Master continuously scans the DAG to find tasks whose dependencies are satisfied and submits them for execution. At the same time, it updates the workflow state based on task results:

private void updateWorkflowStatus() {
    if (allTasksSuccess()) {
        workflowInstance.setState(SUCCESS);
    } else if (anyTaskFailureWithoutRetry()) {
        workflowInstance.setState(FAILURE);
    }
}

The core idea here is:

A Workflow does not actively “execute.” It simply advances its state based on changes in task states.

State transitions are the true event source driving the scheduling loop.

How the DolphinScheduler State Machine Works

From an architectural perspective, the state machine mechanism in DolphinScheduler can be abstracted into three collaborative layers:

Database
Master
Worker

The database persists the state fields of all WorkflowInstance and TaskInstance.

The Worker executes tasks and reports status.

The Master acts as a state promoter, making the next scheduling decision based on state changes stored in the database.

When the Master starts, it does not rely on an in-memory snapshot. Instead, it performs recovery logic similar to the following:

public void recover() {
    List<WorkflowInstance> runningWorkflows = workflowDao.findRunning();
    for (WorkflowInstance wf : runningWorkflows) {
        rebuildWorkflowContext(wf);
        scheduleWorkflow(wf);
    }
}

For tasks that remain in the RUNNING state without heartbeats for a long time, the Master performs timeout detection:

if (taskInstance.getState() == RUNNING && timeout(taskInstance)) {
    taskInstance.setState(NEED_FAULT_TOLERANCE);
    updateTaskState(taskInstance);
}

After that, the state re-enters the schedulable phase.

In other words, the Master itself does not maintain complex execution logic. It simply keeps reading states, evaluating conditions, and advancing states.

This is a database-driven scheduling model.
The system’s robustness comes from reconstructable state, not from stable nodes.

Why “Stuck Tasks” Are Usually Not Bugs

In production environments, “tasks getting stuck” is one of the most common complaints.

But from the perspective of a state machine, what appears to be “stuck” is often the system making a conservative decision.

For example, a task is in the RUNNING state but the Worker has lost connection. The system must choose between:

immediately marking it as failed and retrying, or
waiting longer to confirm whether the node will recover.

If it rolls back too early, duplicate execution may occur.
If it waits longer, it may appear to users as if the task is “stuck.”

This reflects a trade-off between reliability and real-time responsiveness.

State machine design typically prioritizes avoiding side effects, so it prefers delayed judgment rather than premature rollback.

Another common scenario is a failed state update. The task has completed successfully, but the database write fails, leaving the state as RUNNING.

In such cases, the system must rely on idempotent logic and recovery scanning mechanisms to ensure eventual consistency, rather than transient in-memory results.

Therefore, many so-called “anomalies” are actually external manifestations of distributed consistency strategies.

How State Machines Ensure Reliability

State machine design gives scheduling systems four core capabilities:

Idempotency
Recoverability
Eventual consistency
Observability

Idempotency is achieved through state checks, ensuring that completed tasks are not executed again.

Recoverability is enabled through intermediate states such as NEED_FAULT_TOLERANCE, allowing tasks to re-enter the scheduling loop after failures.

Eventual consistency relies on the Master's state scanning mechanism after restart, allowing the system to converge to the correct state even after node failures.

Observability comes from clear state semantics, which makes troubleshooting possible.

The real challenge of a scheduling system is not how to submit tasks, but how to maintain correct state evolution under various failure scenarios.

If the state machine design contains flaws, it can lead to duplicate execution, missed execution, state disorder, or even deadlocks.

Conclusion

In Apache DolphinScheduler, reliability is not the feature of a single module. It is the result of the entire system being built around a state machine.

TaskInstance is the smallest state unit.
WorkflowInstance is the composite state machine.
The Master acts as the state promoter.
And the database holds the ultimate truth of the system state.

The soul of a scheduling system is never the executor—it is the flow of states.

The real difficulty is not making tasks run, but ensuring that states remain consistent, logic remains recoverable, and results eventually converge, no matter what failures occur.

Once we understand this, we realize:

Scheduling systems are difficult because reliability itself is an engineering art built on state machines.