Apache SeaTunnel

Posted on Jun 26

Debugging a Distributed Job Stuck in CANCELING in Apache SeaTunnel

#apacheseatunnel #programming #developer #opensource

Recently, I worked on an issue in Apache SeaTunnel where a job could sometimes stay in the CANCELING state forever after a user requested cancellation.

At first, I thought this would be a simple bug in the cancel logic. Maybe there was a deadlock, an infinite retry loop, or a missing state transition somewhere.

But after tracing the code more carefully, I realized the issue was more subtle. It was related to master-worker communication, master failover, job state recovery timing, and how one exception was handled during task status notification.

This post is a short troubleshooting note about what I found and why I added a separate Force Stop mechanism instead of directly rewriting the existing Cancel logic.

Background

SeaTunnel’s Zeta engine runs jobs across a cluster.

The master node manages job-level state, and worker nodes execute task groups. When a user cancels a job, the master needs to notify the worker that is running the task. After the task finishes or is canceled, the worker reports its final task state back to the master.

A simplified flow looks like this:

User requests cancel
        ↓
Master sends cancel request to worker
        ↓
Worker cancels or finishes the task
        ↓
Worker reports final task state to master
        ↓
Master updates the job state

This looks simple, but in a distributed system, each step can race with node failure, membership changes, or master recovery.

The Symptom

The symptom was that a job stayed in CANCELING indefinitely.

The user had already requested cancellation, but the job never moved to a final state such as CANCELED.

At first, I mainly focused on the master-to-worker cancel request path.

First Suspicion: The Cancel Request Path

On the master side, SeaTunnel sends a CancelTaskOperation to the worker node.

One important part of the logic is that it checks whether the current execution address still exists in the cluster membership before sending the cancel operation:

while (!taskFuture.isDone()
        && nodeEngine
                .getClusterService()
                .getMember(executionAddress = getCurrentExecutionAddress())
            != null) {
    try {
        nodeEngine
                .getOperationService()
                .createInvocationBuilder(
                        Constant.SEATUNNEL_SERVICE_NAME,
                        new CancelTaskOperation(taskGroupLocation),
                        executionAddress)
                .invoke()
                .get();
        return;
    } catch (Exception e) {
        Thread.sleep(2000);
    }
}

This looked suspicious to me.

If the worker node temporarily disappears from the cluster view, for example because of a heartbeat issue, the loop may exit without sending the cancel request at all.

So my first thought was:

Maybe the master believes it has handled cancellation, but the worker never received the cancel request.

This path is still worth paying attention to. However, I later realized that it was not enough to fully explain why the job could remain in CANCELING forever.

Even if the cancel request is missed, the task may eventually finish and report its final state to the master.

So I started looking at the opposite direction:

What happens when the worker reports its final task state back to the master?

The More Important Path: Worker-to-Master Notification

When a task reaches a final state, the worker calls notifyTaskStatusToMaster.

The method is designed to retry until the notification succeeds:

while (isRunning && !notifyStateSuccess) {
    InvocationFuture<Object> invoke =
            nodeEngine
                    .getOperationService()
                    .createInvocationBuilder(
                            SeaTunnelServer.SERVICE_NAME,
                            new NotifyTaskStatusOperation(
                                    taskGroupLocation, taskExecutionState),
                            nodeEngine.getMasterAddress())
                    .invoke();
    try {
        invoke.get();
        notifyStateSuccess = true;
    } catch (JobNotFoundException e) {
        logger.warning("send notify task status failed because can't find job", e);
        notifyStateSuccess = true;
    } catch (ExecutionException e) {
        if (e.getCause() instanceof JobNotFoundException) {
            logger.warning("send notify task status failed because can't find job", e);
            notifyStateSuccess = true;
        } else {
            Thread.sleep(sleepTime);
        }
    }
}

The retry logic itself looked reasonable at first.

But the JobNotFoundException handling was important.

If the worker receives JobNotFoundException, it sets notifyStateSuccess = true and stops retrying.

In many cases, this behavior makes sense. If the job no longer exists on the master, it may mean the job has already finished and was removed from the running job map.

But during master failover, this assumption can become dangerous.

Where JobNotFoundException Comes From

On the master side, the task state update checks runningJobMasterMap:

public void updateTaskExecutionState(TaskExecutionState taskExecutionState) {
    TaskGroupLocation taskGroupLocation = taskExecutionState.getTaskGroupLocation();
    JobMaster runningJobMaster = runningJobMasterMap.get(taskGroupLocation.getJobId());
    if (runningJobMaster == null) {
        throw new JobNotFoundException(
                String.format("Job %s not running", taskGroupLocation.getJobId()));
    }
    runningJobMaster.updateTaskExecutionState(taskExecutionState);
}

Normally, runningJobMaster == null means the job is not running anymore.

However, there is another possible timing window.

During master failover, the new master may not have fully restored runningJobMasterMap yet. If a worker sends its final task status during that window, the new master can fail to find the corresponding JobMaster and throw JobNotFoundException.

Then the worker treats this exception as success and stops retrying.

The problematic sequence can look like this:

A job is being canceled.
A master switch happens.
A worker finishes its task and sends the final task state.
The new master has not fully restored runningJobMasterMap yet.
The master throws JobNotFoundException.
The worker treats this as success and stops retrying.
The master never receives the final task state after recovery.
The job remains in CANCELING.

This was the key point for me.

The issue was not just that the cancel RPC might fail. The more important problem was that the worker’s final status notification could be lost during master recovery.

Why I Added Force Stop Instead of Rewriting Cancel

After finding this path, I considered whether the existing Cancel logic should be changed directly.

There were several possible directions:

change how JobNotFoundException is handled,
wait until runningJobMasterMap recovery is completed,
store more running job state in distributed storage,
or redesign the cancellation state transition.

But the issue was intermittent and timing-dependent. Also, the normal Cancel path is a sensitive part of the execution lifecycle. A direct change there could introduce new behavior changes or performance overhead.

So I chose a more practical approach first.

Instead of changing the meaning of normal Cancel, I added a separate Force Stop mechanism.

The idea was simple:

If graceful cancellation cannot make progress, operators need an explicit way to finalize the job state.

Cancel vs Force Stop

I tried to keep the difference between Cancel and Force Stop clear.

Cancel

Cancel is for graceful termination.

It asks the running task to stop and depends on the normal task lifecycle and worker notification path.

Force Stop

Force Stop is for operational recovery.

It should not depend on whether the remote worker can still respond correctly. The master finalizes the job state and cleans up based on that decision.

In short:

Cancel     = try to stop the job gracefully
Force Stop = finalize the job when Cancel cannot make progress

Force Stop is not meant to replace Cancel. It is a fallback path for stuck cases.

What I Learned

This issue taught me that a stuck job state is not always caused by the code that directly updates that state.

In this case, the cancel request path looked suspicious at first. But the more important problem was on the worker-to-master notification path.

JobNotFoundException looked like a reasonable terminal condition in normal cases, but during master failover it could also mean:

The new master has not recovered the job yet.

Those two meanings are very different.

That small difference can decide whether the worker should stop retrying or keep trying.

For me, this was a good reminder that in distributed systems, exception handling is part of the state machine. It is not just error handling.

Conclusion

The stuck CANCELING issue was not simply a failed cancel request.

A worker could finish its task and try to notify the master, but if this happened during master failover before the new master fully restored its running job state, the master could throw JobNotFoundException. Because the worker treated this exception as a successful terminal condition, it stopped retrying. As a result, the master could miss the final task state and keep the job in CANCELING.

Force Stop was added as a practical recovery mechanism for this kind of situation.

It does not replace normal Cancel. Instead, it gives operators a way to finalize a job when graceful cancellation cannot make progress.

The biggest lesson for me was simple:

In distributed systems, the hard part is not only sending a request. The hard part is deciding what the system should believe when the request races with failure, recovery, or a delayed state.

DEV Community