DEV Community

Chen Debra
Chen Debra

Posted on

How to Solve the Major Issue Caused by DolphinScheduler's Self-Fault Tolerance Leading to Continuous Server Crashes

Image description

Problem Reproduction

In DolphinScheduler, the following Shell task exists:

current_timestamp() {  
    date +"%Y-%m-%d %H:%M:%S"  
}

TIMESTAMP=$(current_timestamp)
echo $TIMESTAMP
sleep 60
Enter fullscreen mode Exit fullscreen mode

The workflow execution strategy in DolphinScheduler is set to parallel.

The scheduling interval is set to 10 seconds.

After deploying the scheduled task, it runs as expected.

At this point, kill the Master node to simulate a crash:

$ jps
1979710 AlertServer
1979626 WorkerServer
1979546 MasterServer
1979794 ApiApplicationServer
1980483 Jps
$ kill -9 1979546
Enter fullscreen mode Exit fullscreen mode

Upon checking DolphinScheduler, we notice that the Master node is no longer available.

At this point, we observe that the workflow in DolphinScheduler stops scheduling tasks, and all tasks keep running indefinitely until an error occurs.

After a while (simulating the discovery of the crash issue), restart DolphinScheduler:

sh bin/stop-all.sh
sh bin/start-all.sh
Enter fullscreen mode Exit fullscreen mode

Once restarted, all the previously failed tasks, including unscheduled tasks, are executed.

This creates a critical issue: If these are high-performance tasks, it can lead to excessive CPU and memory usage, causing the entire server to crash!!!

Image description
CPU usage and RAM usage

Multi-Scenario Testing

  • Master crash, restart the entire DS: This causes the above issue.
  • Master crash, restart the corresponding Master: This causes the above issue — it’s a defect, as the official script doesn’t have a separate background process for Master, only a foreground start script. However, you can rerun start-all.sh.
  • Worker crash, restart the entire DS: This does not cause the above issue — because the Master will continue scheduling tasks, and the tasks will fail directly after the Worker crashes.
  • Worker crash, restart the corresponding Worker: This does not cause the above issue — it's similar to the Worker crash scenario.
  • Entire DS crash, restart the entire DS: This causes the above issue.
  • DS stopped using stop-all.sh, then restart: This causes the above issue.

The core issue lies with the Master. As long as periodic tasks are configured, whether the Master crashes or is stopped via a script, the above issue will occur.

Principle Analysis

Core components of DolphinScheduler:

  • MasterServer: Responsible for DAG task segmentation, task submission monitoring, and monitoring the health status of other Master and Worker servers. When started, it registers temporary nodes in ZooKeeper and handles fault tolerance through ZooKeeper node changes.
  • WorkerServer: Responsible for task execution and log service. It registers a temporary node in ZooKeeper and maintains a heartbeat.
  • ApiServer: Handles requests from the front-end UI.

The task execution flow can be summarized as follows:

  1. Task creation: A task is created in the API-Server and persisted into the database.
  2. Command generation: A user manually triggers or schedules a task, which writes a command to the database to execute the workflow.
  3. Master consumption of commands: The Master consumes commands from the database, starts the workflow, and assigns tasks to the Worker for execution.
  4. Completion: After the workflow is completed, the Master finishes the execution of the workflow.

Image description

Based on the official website, the core task execution process of DolphinScheduler can be detailed as follows:

Image description

Given the complexity of task scheduling, a large process can be divided into smaller processes, with auxiliary processes added outside the main flow. Below is an analysis of the breakdown of the execution scheduling process, which makes it easier to understand:

Image description

i. First, the producer API server encapsulates the user's workflow execution HTTP request into command data and inserts it into the t_ds_command table. Below is an example of a command for starting a workflow instance (old version):


{
    "commandType": "START_PROCESS",
    "processDefinitionCode": 14285512555584,
    "executorId": 1,
    "commandParam": "{}",
    "taskDependType": "TASK_POST",
    "failureStrategy": "CONTINUE",
    "warningType": "NONE",
    "startTime": 1723444881372,
    "processInstancePriority": "MEDIUM",
    "updateTime": 1723444881372,
    "workerGroup": "default",
    "tenantCode": "default",
    "environmentCode": -1,
    "dryRun": 0,
    "processInstanceId": 0,
    "processDefinitionVersion": 1,
    "testFlag": 0
}
Enter fullscreen mode Exit fullscreen mode

ii. Next, the consumer, the MasterSchedulerBootstrap loop program in the master server, uses Zookeeper (ZK) to allocate slots to itself. MasterSchedulerBootstrap selects the list of commands belonging to its slot from the t_ds_command table for processing. The query statement is as follows:


<select id="queryCommandPageBySlot" resultType="org.apache.dolphinscheduler.dao.entity.Command">
        select *
        from t_ds_command
        where id % #{masterCount} = #{thisMasterSlot}
        order by process_instance_priority, id asc
            limit #{limit}
</select>
Enter fullscreen mode Exit fullscreen mode

iii. The MasterSchedulerBootstrap loop polls for commands to process. It generates a ProcessInstance from the command task and the master host, inserts the ProcessInstance object into the t_ds_process_instance table, and creates an executable task workflowExecuteRunnable that contains the necessary runtime context information. This workflowExecuteRunnable is cached locally in the processInstanceExecCacheManager. At the same time, the WorkflowEventType.START_WORKFLOW of the ProcessInstance is produced into the workflowEventQueue queue.

The steps mentioned above occur after a user clicks the start task button on the web page. However, the issue being addressed here is related to the periodic scheduling in the Master. After reviewing the documentation, the periodic scheduling task is encapsulated by the MasterServer as command data and inserted into the t_ds_process_instance table. Subsequent steps follow as outlined, and the general process is as follows:

  1. Command Dispatching: Triggered by a user's submitted workflow request, the MasterServer encapsulates it into command data and inserts it into the database.

  2. Task Assignment: The MasterServer continuously queries for pending commands and assigns tasks to the corresponding ProcessInstance based on the load.

  3. Task Execution: Based on the DAG (Directed Acyclic Graph) dependencies, the WorkerServer will prioritize tasks that have no dependencies and gradually execute other tasks according to their priority.

  4. Status Feedback: During task execution, the WorkerServer periodically calls back to the MasterServer to notify the progress and execution status of the tasks.

The issue arises when Master restarts — it causes an excessive amount of tasks to be queued in the t_ds_command table, particularly periodic tasks.

In DolphinScheduler 3.2.1, the data sample for the t_ds_command table is as follows:


id  |command_type|process_definition_code|process_definition_version|process_instance_id|command_param                        |task_depend_type|failure_strategy|warning_type|warning_group_id|schedule_time      |start_time         |executor_id|update_time        |process_instance_priority|worker_group|tenant_code|environment_code|dry_run|test_flag|
----+------------+-----------------------+--------------------------+-------------------+-------------------------------------+----------------+----------------+------------+----------------+-------------------+-------------------+-----------+-------------------+-------------------------+------------+-----------+----------------+-------+---------+
1988|           6|         15921642898976|                         4|                  0|{"schedule_timezone":"Asia/Shanghai"}|               2|               1|           0|               0|2024-12-11 00:36:40|2024-12-11 00:39:01|          2|2024-12-11 00:39:01|                        2|default     |default    |              -1|      0|        0|
1989|           6|         15921642898976|                         4|                  0|{"schedule_timezone":"Asia/Shanghai"}|               2|               1|           0|               0|2024-12-11 00:36:50|2024-12-11 00:39:01|          2|2024-12-11 00:39:01|                        2|default     |default    |              -1|      0|        0|
1990|           6|         15921642898976|                         4|                  0|{"schedule_timezone":"Asia/Shanghai"}|               2|               1|           0|               0|2024-12-11 00:37:00|2024-12-11 00:39:01|          2|2024-12-11 00:39:01|                        2|default     |default    |              -1|      0|        0|
Enter fullscreen mode Exit fullscreen mode

The command_type enum is defined in the source code under CommandType, and its content is as follows:


/**
 * command types
 * 0 start a new process
 * 1 start a new process from current nodes
 * 2 recover tolerance fault process
 * 3 recover suspended process
 * 4 start process from failure task nodes
 * 5 complement data
 * 6 start a new process from scheduler
 * 7 repeat running a process
 * 8 pause a process
 * 9 stop a process
 * 10 recover waiting thread
 * 11 recover serial wait
 * 12 start a task node in a process instance
 */
START_PROCESS(0, "start a new process"),
START_CURRENT_TASK_PROCESS(1, "start a new process from current nodes"),
RECOVER_TOLERANCE_FAULT_PROCESS(2, "recover tolerance fault process"),
RECOVER_SUSPENDED_PROCESS(3, "recover suspended process"),
START_FAILURE_TASK_PROCESS(4, "start process from failure task nodes"),
COMPLEMENT_DATA(5, "complement data"),
SCHEDULER(6, "start a new process from scheduler"),
REPEAT_RUNNING(7, "repeat running a process"),
PAUSE(8, "pause a process"),
STOP(9, "stop a process"),
RECOVER_WAITING_THREAD(10, "recover waiting thread"),
RECOVER_SERIAL_WAIT(11, "recover serial wait"),
EXECUTE_TASK(12, "start a task node in a process instance"),
DYNAMIC_GENERATION(13, "dynamic generation"),
;
Enter fullscreen mode Exit fullscreen mode

The reason for this behavior lies in the fault tolerance mechanism of the Master itself. This fault tolerance mechanism can be broken down into several modules as follows:

  1. Master's fault tolerance: In a multi-Master setup, one node is the Active Master responsible for handling task scheduling requests, while the others serve as Standby Masters. If the Active Master fails, a Standby Master automatically takes over its duties to ensure the system operates normally. This process is managed through ZooKeeper, which is responsible for electing the Active Master node and monitoring the node's status.

  2. State synchronization: Multiple Master nodes synchronize their states to ensure that in the event of an Active Master failure, a Standby Master can take over the task scheduling seamlessly.

  3. Fault recovery: When a Master node fails, other Master nodes use ZooKeeper's Watcher mechanism to detect the failure and trigger fault recovery processes.

  4. Fault tolerance for running tasks: After a Master node fails, a new Master retrieves the list of ProcessInstances requiring fault tolerance by accessing the failed Master’s address and the ongoing workflow status array. This list is then inserted into the t_ds_command table, and the regular scheduling process (Master retrieves and schedules the tasks, followed by Worker execution) resumes.

  5. Distributed locks: During the fault tolerance process, the Master nodes use ZooKeeper’s distributed locking mechanism, combined with the assignment of specific IDs to the command table, to ensure that only one Master node performs the fault tolerance operation at a time. This prevents multiple Master nodes from simultaneously taking over the same task.

  6. Periodic fault tolerance thread: Apart from the fault tolerance triggered by ZooKeeper events, DolphinScheduler also implements a periodic thread called FailoverExecuteThread. This thread is responsible for restoring workflow instances after the Master has been restarted.

  7. Task retry: DolphinScheduler also supports a task retry mechanism after a task failure. This complements the service downtime fault tolerance, ensuring that tasks are eventually executed successfully.

So, based on the principle and reproduction, it can be initially inferred that the fault tolerance is carried out by a specific thread during the startup of the Master. Next, we should examine the source code to verify this.

Source Code Analysis

In the org.apache.dolphinscheduler.server.master.MasterServer class, during Master startup, there’s an entry point for the run() method:

/**
 * run master server
 */
@PostConstruct
public void run() throws SchedulerException {
    // init rpc server
    this.masterRPCServer.start();
    // install task plugin
    this.taskPluginManager.loadPlugin();
    this.masterSlotManager.start();
    // self tolerant
    this.masterRegistryClient.start();
    this.masterRegistryClient.setRegistryStoppable(this);
    this.masterSchedulerBootstrap.start();
    this.eventExecuteService.start();
    this.failoverExecuteThread.start();
    this.schedulerApi.start();
    this.taskGroupCoordinator.start();
    MasterServerMetrics.registerMasterCpuUsageGauge(() -> {
        SystemMetrics systemMetrics = metricsProvider.getSystemMetrics();
        return systemMetrics.getTotalCpuUsedPercentage();
    });
    MasterServerMetrics.registerMasterMemoryAvailableGauge(() -> {
        SystemMetrics systemMetrics = metricsProvider.getSystemMetrics();
        return (systemMetrics.getSystemMemoryMax() - systemMetrics.getSystemMemoryUsed()) / 1024.0 / 1024 / 1024;
    });
    MasterServerMetrics.registerMasterMemoryUsageGauge(() -> {
        SystemMetrics systemMetrics = metricsProvider.getSystemMetrics();
        return systemMetrics.getJvmMemoryUsedPercentage();
    });
    Runtime.getRuntime().addShutdownHook(new Thread(() -> {
        if (!ServerLifeCycleManager.isStopped()) {
            close("MasterServer shutdownHook");
        }
    }));
}
Enter fullscreen mode Exit fullscreen mode

From the above code, it can be seen that the Master starts by executing:

  • masterRPCServer.start(): Initializes and starts the RPC server for node communication.
  • taskPluginManager.loadPlugin(): Loads task plugins, which extend the task types of DolphinScheduler.
  • masterSlotManager.start(): Starts the Master slot manager, which is responsible for managing the resource slots of the Master for task scheduling.
  • masterRegistryClient.start(): Starts the Master registry client, which is responsible for registering the Master node to the distributed coordination service (such as ZooKeeper).
  • masterRegistryClient.setRegistryStoppable(this): Sets the stoppable object for the registry client, allowing it to clean up when the Master stops.
  • masterSchedulerBootstrap.start(): Starts the Master scheduling bootstrap service, which initializes scheduling-related services.
  • eventExecuteService.start(): Starts the event execution service, which handles events within workflows, such as task state changes.
  • failoverExecuteThread.start(): Starts the failover execution thread, which is responsible for recovering task execution after the Master crashes.
  • schedulerApi.start(): Starts the scheduling API service, providing scheduling-related interfaces for external calls.
  • taskGroupCoordinator.start(): Starts the task group coordinator, which coordinates the execution of tasks within a task group.

Through the source code inspection, it is found that the most critical part, failoverExecuteThread, is not about re-executing unscheduled periodic tasks, but about fault tolerance for tasks that have not been completed. Moreover, there is no related content in the source code regarding restoring the periodic task scheduling.

Now, a new approach is needed, which is to move from the bottom up:

First, it is found that after restarting and recovering, the "run type" on the web page is "Scheduled Execution," while the "command_type" in the database is "6". This means there must be a service that inserts a command with command_type = 6 into the database, and it will fetch task scheduling instances from the t_ds_schedules table.

According to the source code, by tracing the dolphinscheduler-dao project, which contains all database operation DAOs, we can find the ScheduleMapper class, which is the DAO class related to the t_ds_schedules table. Then, by tracing t_ds_command, we found the createCommand method in the CommandServiceImpl class. By cross-referencing the two, with command_type being 6, we found the executeInternal method in the ProcessScheduleTask class.

The executeInternal method in the ProcessScheduleTask class satisfies three conditions: it fetches the scheduling task, inserts command data, and the type is 6.

By reviewing the source code of executeInternal, the first part retrieves the scheduled time and actual execution time from the Quartz context. The second part validates whether the Cron for this scheduling exists and is online.

In executeInternal, the key elements are actually scheduledFireTime and fireTime.

With this information, we can summarize the scheduling principles of DolphinScheduler + Quartz:

  • The web page sets the schedule, which is done through SchedulerController.createSchedule() to create the schedule and insert a record into t_ds_schedules.
  • When the schedule is online, it creates a Trigger in Quartz using QuartzScheduler.insertOrUpdateScheduleTask() and inserts a record into the QRTZ_CRON_TRIGGERS table.
  • Periodically, ProcessScheduleTask.executeInternal() is called to insert data into t_ds_command.
  • Then, it proceeds with the Master-Worker execution flow.

After understanding the general scheduling flow, combined with scheduledFireTime and fireTime, we can deduce that the scheduling time is not set by DolphinScheduler, but by Quartz.

Next, we look into Quartz-related materials and find that Quartz has a Misfire mechanism: A periodic task A needs to execute at a specified time, but due to some reason, task A was not executed, which is called MisFire.

Quartz has a configuration item to determine whether a task is MisFire: org.quartz.jobStore.misfireThreshold, which defaults to 60000ms (i.e., 60 seconds).

Two conditions are required for Misfire to occur:

  1. The job was not executed when it reached the trigger time.
  2. The delay in executing the job exceeds the misfireThreshold configured in Quartz.

If the delay in executing the job is less than the threshold, Quartz will not consider it a Misfire and will execute the job immediately. If the delay is greater than or equal to the threshold, it is considered a Misfire, and Quartz will execute it according to the specified strategy.

The common reasons for Misfire are:

  • When the job reaches the trigger time, all threads are occupied by other jobs, and there are no available threads.
  • The scheduler stopped (possibly unexpectedly) at the trigger time. [— This is the type of problem here.]
  • The job is annotated with @DisallowConcurrentExecution, meaning it cannot be executed concurrently, and when the next execution point comes, the previous task has not completed.
  • The job has specified a past start time. For example, if the current time is 8:00:00 AM, and the start time is set to 7:00:00 AM.

Once Quartz determines a task is a Misfire, it will trigger a compensation mechanism, but the compensation mechanism is only executed after the task is confirmed as Misfire. The compensation mechanism is configured in Quartz's Trigger source code:

public interface Trigger extends Serializable, Cloneable, Comparable<Trigger> {
    long serialVersionUID = -3904243490805975570L;
    int MISFIRE_INSTRUCTION_SMART_POLICY = 0;
    int MISFIRE_INSTRUCTION_IGNORE_MISFIRE_POLICY = -1;
    int DEFAULT_PRIORITY = 5;
    ......
Enter fullscreen mode Exit fullscreen mode

But this compensation mechanism needs to be determined based on the Trigger. The following are different types of Triggers:

Image description

In DolphinScheduler, various types of Triggers will involve:

Image description

Types of Trigger:

  • SimpleTrigger is a simple trigger used to execute repeated tasks. It can specify a start time and then repeat the task at fixed intervals until the specified repeat count is reached. The properties of SimpleTrigger include repeat interval (repeatInterval) and repeat count (repeatCount), with the actual number of executions being repeatCount + 1, as the task is executed once at the start time (startTime).

  • CronTrigger: CronTrigger uses a Cron expression to define a complex scheduling plan. A Cron expression consists of 6 or 7 space-separated time fields, representing seconds, minutes, hours, day of month, month, day of week, and an optional year. CronTrigger allows setting up very complex trigger schedules, which cover most capabilities of other triggers.

  • CalendarIntervalTrigger: CalendarIntervalTrigger specifies tasks to be executed at certain time intervals starting from a specific time. Unlike SimpleTrigger, which only supports millisecond time intervals, CalendarIntervalTrigger supports interval units like seconds, minutes, hours, days, months, and years. It is suitable for tasks like executing once a week.

  • DailyTimeIntervalTrigger: DailyTimeIntervalTrigger specifies tasks to be executed at certain time intervals within a specific time each day. It also supports specifying weekdays. It is suitable for tasks like executing every 70 seconds between 9:00 and 18:00, and only on weekdays.

Therefore, since different types of Triggers have different parameters, when the Trigger triggers the Misfire mechanism, the strategy will also vary based on the Trigger:

/**
 * Common Misfire mechanism in the Trigger class
 **/
 // This is an intelligent strategy, and Quartz will automatically select an appropriate misfire strategy based on the type of Trigger. For CronTrigger, the default is MISFIRE_INSTRUCTION_FIRE_ONCE_NOW.
int MISFIRE_INSTRUCTION_SMART_POLICY = 0;
// This strategy immediately executes all missed trigger events and compensates for all missed actions. Even if the scheduled task's time has ended, it will execute all the tasks that should have been executed all at once.
int MISFIRE_INSTRUCTION_IGNORE_MISFIRE_POLICY = -1;
/**
 * Misfire mechanism for SimpleTrigger, in the SimpleTrigger class
 **/
 // If the trigger misses the scheduled time, this strategy will immediately execute one task and then continue executing subsequent tasks according to the original plan.
int MISFIRE_INSTRUCTION_FIRE_NOW = 1;
// This strategy will set the start time of the trigger to the current time and immediately execute the missed task, including any missed repeat counts.
int MISFIRE_INSTRUCTION_RESCHEDULE_NOW_WITH_EXISTING_REPEAT_COUNT = 2;
// Similar to MISFIRE_INSTRUCTION_RESCHEDULE_NOW_WITH_EXISTING_REPEAT_COUNT, but it will ignore the missed trigger counts and only execute the remaining repeat counts.
int MISFIRE_INSTRUCTION_RESCHEDULE_NOW_WITH_REMAINING_REPEAT_COUNT = 3;
// This strategy will ignore the missed trigger counts and execute the task at the next scheduled time, executing the remaining repeat counts.
int MISFIRE_INSTRUCTION_RESCHEDULE_NEXT_WITH_REMAINING_COUNT = 4;
// Similar to MISFIRE_INSTRUCTION_RESCHEDULE_NEXT_WITH_REMAINING_COUNT, but it will include all the missed repeat counts.
int MISFIRE_INSTRUCTION_RESCHEDULE_NEXT_WITH_EXISTING_COUNT = 5;
/** 
 * Misfire mechanism for CronTrigger, in the CronTrigger class
 **/
 // If the trigger misses the scheduled time, this strategy will immediately execute one task and then continue executing subsequent tasks according to the original plan.
int MISFIRE_INSTRUCTION_FIRE_ONCE_NOW = 1;
// For CronTrigger, this strategy will ignore all missed trigger events and wait directly for the next scheduled trigger time.
int MISFIRE_INSTRUCTION_DO_NOTHING = 2;
......
Enter fullscreen mode Exit fullscreen mode

In QuartzScheduler.insertOrUpdateScheduleTask(), only CronTrigger is used, and the source code is as follows:

CronTrigger cronTrigger = newTrigger()
        .withIdentity(triggerKey)
        .startAt(startDate)
        .endAt(endDate)
        .withSchedule(
                cronSchedule(cronExpression)
                        .withMisfireHandlingInstructionIgnoreMisfires()
                        .inTimeZone(DateUtils.getTimezone(timezoneId)))
        .forJob(jobDetail).build();
Enter fullscreen mode Exit fullscreen mode

Further down:

public CronScheduleBuilder withMisfireHandlingInstructionIgnoreMisfires() {
    this.misfireInstruction = -1;
    return this;
}
Enter fullscreen mode Exit fullscreen mode

Its compensation mechanism uses the -1 encoding, which means that all missed trigger events will be immediately executed, and all compensation actions will be performed. So now we can explain why, after a Master restart, all the unscheduled periodic tasks will be executed once!!!

This setting, depending on the Trigger, can also have different parameters set:

/**
 * For CronTrigger, refer to settings in CronScheduleBuilder
 **/
public CronScheduleBuilder withMisfireHandlingInstructionIgnoreMisfires() {
    this.misfireInstruction = -1;
    return this;
}
public CronScheduleBuilder withMisfireHandlingInstructionDoNothing() {
    this.misfireInstruction = 2;
    return this;
}
public CronScheduleBuilder withMisfireHandlingInstructionFireAndProceed() {
    this.misfireInstruction = 1;
    return this;
}
/**
 * For SimpleTrigger, refer to settings in SimpleScheduleBuilder
**/
public SimpleScheduleBuilder withMisfireHandlingInstructionIgnoreMisfires() {
    this.misfireInstruction = -1;
    return this;
}
public SimpleScheduleBuilder withMisfireHandlingInstructionFireNow() {
    this.misfireInstruction = 1;
    return this;
}
public SimpleScheduleBuilder withMisfireHandlingInstructionNextWithExistingCount() {
    this.misfireInstruction = 5;
    return this;
}
public SimpleScheduleBuilder withMisfireHandlingInstructionNextWithRemainingCount() {
    this.misfireInstruction = 4;
    return this;
}
public SimpleScheduleBuilder withMisfireHandlingInstructionNowWithExistingCount() {
    this.misfireInstruction = 2;
    return this;
}
public SimpleScheduleBuilder withMisfireHandlingInstructionNowWithRemainingCount() {
    this.misfireInstruction = 3;
    return this;
}
Enter fullscreen mode Exit fullscreen mode

Explanation of the Misfire Mechanism in Quartz

Set the task to "Serial Wait" – This is feasible, but it cannot fully leverage the parallelization advantages of big data clusters. There is also a fatal flaw: tasks set to serial wait cannot be manually stopped via the page, and the status must be modified or data deleted in the t_ds_process_instance table.

Master HA: For a single Master, set up a daemon for the Master so that it automatically restarts after a failure (although sometimes it cannot restart); for multi-Master, deploy multiple Masters to enable HA.

DolphinScheduler Monitoring Alerts: Continuously monitor the running status of DolphinScheduler. When a role fails, an alert will be sent in time (there may be cases where a failure happens in the middle of the night and the operations team does not notice the alert).

Set CPU and Memory Usage Thresholds for DolphinScheduler: In the configuration file, the default CPU and memory thresholds are set to 70%, meaning that when the server's CPU and memory usage reaches 70%, DolphinScheduler will no longer schedule tasks on this server. The benefit of this approach is that it ensures the server’s resources are not fully utilized. The downside is that if the Master’s fault-tolerant old tasks occupy the resources, it will affect the new tasks in DolphinScheduler when running normally. Additionally, some tasks are very critical and must be executed successfully.

Set the Number of Tasks for DolphinScheduler: In the configuration file, DolphinScheduler's default task count is 100 for each Worker and 1000 for each Master. In the live environment, it is not possible to control the number of tasks precisely, and DolphinScheduler cannot automatically adjust the task allocation.

Delete Data in the t_ds_command Table Before Restarting After a Failure: After verification, when the Master fails, it will not write data to t_ds_command. It will write the data to t_ds_command after restarting, but the time for this is about 1–2 seconds, and it cannot be manually deleted.

Modify Data in the t_ds_process_instance Table: Based on time intervals, modify the status of all workflows in the t_ds_process_instance table within that time range to manually end them (but if DolphinScheduler and the metadata database are on the same server, it can easily fill up the server’s resources after DolphinScheduler restarts, which may make the metadata database inaccessible).

The above solutions can mainly be divided into:

  1. Avoid or reduce Master downtime.
  2. Do not run MisFire tasks after Master downtime.

First, "Avoid or Reduce Master Downtime": This is difficult to achieve in a production environment. The assumption in computer programs is that some issues will occur with 100% certainty at a certain point in time, which is why there are various microservice architectures, high availability (HA), multi-active, and disaster recovery mechanisms.

Second, "Do Not Run MisFire Tasks": According to the previous solutions, no solution can fully solve this problem. Therefore, based on the last code source analysis, it is necessary to consider modifying the source code and recompiling the package to solve the issue.

Modify the Source Code

Modify the key source code to:

CronTrigger cronTrigger = newTrigger()
        .withIdentity(triggerKey)
        .startAt(startDate)
        .endAt(endDate)
        .withSchedule(
                cronSchedule(cronExpression)
                        .withMisfireHandlingInstructionDoNothing()
//                        .withMisfireHandlingInstructionIgnoreMisfires()
                        .inTimeZone(DateUtils.getTimezone(timezoneId)))
        .forJob(jobDetail).build();
Enter fullscreen mode Exit fullscreen mode

Development Environment Verification

Use Java 8 for the verification.

Modify the MySQL connection information in the application.yaml files of Master, Worker, and API:

spring:
  config:
    activate:
      on-profile: mysql
  datasource:
    driver-class-name: com.mysql.cj.jdbc.Driver
    url: jdbc:mysql://IP_ADDRESS:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8&useSSL=false
    username: USERNAME
    password: PASSWORD
  quartz:
    properties:
      org.quartz.jobStore.driverDelegateClass: org.quartz.impl.jdbcjobstore.StdJDBCDelegate
Enter fullscreen mode Exit fullscreen mode

Modify the Zookeeper information in the application.yaml files of Master, Worker, and API:

registry:
  type: zookeeper
  zookeeper:
    namespace: dolphinscheduler_dev
    connect-string: IP_ADDRESS:2181
    retry-policy:
      base-sleep-time: 60ms
      max-sleep: 300ms
      max-retries: 5
    session-timeout: 30s
    connection-timeout: 9s
    block-until-connected: 600ms
    digest: ~
Enter fullscreen mode Exit fullscreen mode

Modify the pom.xml under the BOM:

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>${mysql-connector.version}</version>
<!--                <scope>test</scope>-->
</dependency>
Enter fullscreen mode Exit fullscreen mode

Modify the logback-spring.xml under the API, Master, and Worker to enable runtime logging:

<root level="INFO">
<!--        <if condition="${DOCKER:-false}">-->
<!--            <then>-->
<!--                <appender-ref ref="STDOUT"/>-->
<!--            </then>-->
<!--        </if>-->
    <appender-ref ref="STDOUT"/>
    <appender-ref ref="APILOGFILE"/>
</root>

<root level="INFO">
<!--        <if condition="${DOCKER:-false}">-->
<!--            <then>-->
<!--                <appender-ref ref="STDOUT"/>-->
<!--            </then>-->
<!--        </if>-->
    <appender-ref ref="STDOUT"/>
    <appender-ref ref="TASKLOGFILE"/>
    <appender-ref ref="MASTERLOGFILE"/>
</root>

<root level="INFO">
<!--        <if condition="${DOCKER:-false}">-->
<!--            <then>-->
<!--                <appender-ref ref="STDOUT"/>-->
<!--            </then>-->
<!--        </if>-->
    <appender-ref ref="STDOUT"/>
    <appender-ref ref="TASKLOGFILE"/>
    <appender-ref ref="WORKERLOGFILE"/>
</root>
Enter fullscreen mode Exit fullscreen mode

Start Master, Worker, and API:

Master VM Options: -Dlogging.config=classpath:logback-spring.xml -Ddruid.mysql.usePingMethod=false -Dspring.profiles.active=mysql

Worker VM Options: -Dlogging.config=classpath:logback-spring.xml -Ddruid.mysql.usePingMethod=false -Dspring.profiles.active=mysql

API VM Options: -Dlogging.config=classpath:logback-spring.xml -Dspring.profiles.active=api,mysql

Image description

If the error occurs:

Error running 'ApiApplicationServer'
Error running ApiApplicationServer.
The command line is too long. Shorten the command line via JAR manifest or via a classpath file and rerun.
Enter fullscreen mode Exit fullscreen mode

Then add:

Image description

If the error persists due to a missing MySQL JDBC driver, add the following in the pom.xml under Master, Worker, and API:

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>8.0.33</version>
</dependency>
Enter fullscreen mode Exit fullscreen mode

Full Compilation and Packaging

It is important to note that the package created at this point should be the one modified according to the "Source Code Modification" environment, not the one modified according to the "Development Environment Verification" step!

Use Java 8 for the packaging.

Execute the following command in the root directory of the project (packaging may take a while):

mvn spotless:apply clean package -Dmaven.test.skip=true -Prelease
Enter fullscreen mode Exit fullscreen mode

The packaged binary files will be located in dolphinscheduler-dist/target with the .tar.gz extension.

Then, you can try redeploying to verify if the previous issues are resolved.

Compile Only a Single Module

Navigate to the dolphinscheduler-scheduler-quartz directory and execute:

mvn spotless:apply clean package -Dmaven.test.skip=true -Prelease
Enter fullscreen mode Exit fullscreen mode

The packaged file will be in the dolphinscheduler-scheduler-quartz/target directory:

Image description

Replace it on the server:

su dolphinscheduler -
mv /opt/module/dolphinscheduler-3.2.1/master-server/libs/dolphinscheduler-scheduler-quartz-3.2.1.jar /opt/module/dolphinscheduler-3.2.1/master-server/libs/dolphinscheduler-scheduler-quartz-3.2.1.jar.bak
mv /opt/module/dolphinscheduler-3.2.1/api-server/libs/dolphinscheduler-scheduler-quartz-3.2.1.jar /opt/module/dolphinscheduler-3.2.1/api-server/libs/dolphinscheduler-scheduler-quartz-3.2.1.jar.bak

cp dolphinscheduler-scheduler-quartz-3.2.1.jar /opt/module/dolphinscheduler-3.2.1/master-server/libs/dolphinscheduler-scheduler-quartz-3.2.1.jar
cp dolphinscheduler-scheduler-quartz-3.2.1.jar /opt/module/dolphinscheduler-3.2.1/api-server/libs/dolphinscheduler-scheduler-quartz-3.2.1.jar

chown -R dolphinscheduler:dolphinscheduler /opt/module/dolphinscheduler-3.2.1/
Enter fullscreen mode Exit fullscreen mode

Then, you can try restarting DolphinScheduler and verify if the previous issue is resolved.

Issue Resolved

After reproducing the issue again, it was found that the issue had been resolved.

Top comments (0)