How to Resolve Workflow Stuck Issues in Apache DolphinScheduler

#tooling #apachedolphinscheduler #opensource #bigdata

Background

In some cases, Apache DolphinScheduler may run into situations where a workflow appears to be stuck:

The workflow shows a “running” status (the gear icon keeps spinning), but when you open it, none of the task instances are actually running. The workflow stays in this state for an extremely long time—sometimes even days.
The workflow is shown as running, and the sub-workflows it triggers also appear to be running. But after opening a sub-workflow, you find that all task instances have already completed. The sub-workflow remains stuck on a node and never moves forward, preventing subsequent nodes from being triggered. It may stay in this state indefinitely.

Root Cause

From practical experience, issues like this nearly always point to one underlying cause:

DolphinScheduler timed out while interacting with the MySQL database.
If MySQL encounters deadlocks, long-running transactions, or slow queries, the scheduler’s internal state becomes inconsistent with the actual database state.
Once the two fall out of sync, workflow execution cannot continue.

Solution

When DolphinScheduler’s database operations time out, the scheduler may freeze at different points—sometimes before the SQL is executed, sometimes right after.
Since most operations are writes, there is little that DolphinScheduler itself can do.
The only approach is to restore database availability and then retry the workflow operations.

Here is the troubleshooting strategy we typically use:

Check the main DolphinScheduler MySQL tables—workflow definitions, task definitions, workflow instances, task instances—
and verify whether simple update operations are timing out or being blocked by locks.
If MySQL appears healthy:

Verify whether the frontend UI status matches the DB status. If they differ, manually correct the database state to allow the workflow to proceed.
If the states do match, terminate the workflow, manually delete the workflow instance, and rerun the scheduling/backfill task.

If MySQL shows issues:

Check for slow queries or slow updates, and confirm whether the MySQL server is running out of CPU, memory, or storage I/O.
If queries are unable to execute due to table locks, run SHOW PROCESSLIST to locate long-lived sessions—especially those stuck in Sleep.
Kill them manually using:
```
 KILL processid;
```
(The process ID is shown in the first column of the processlist.)

Using the steps above will resolve the majority of workflow-stuck cases.

If you have more questions, feel free to discuss them in the comments.