DEV Community

Chen Debra
Chen Debra

Posted on

Inside DolphinScheduler’s May 2026 Release: Better Failover, Stronger Security, and More Reliable Plugins

The May 2026 DolphinScheduler community update can be summarized with two keywords: stability and precision.

On one hand, major stability risks such as Master failover issues—which can have a significant impact on production environments when failures occur—have been addressed. On the other hand, long-standing usability problems, including API authorization gaps, plugin dependency conflicts, and RemoteShell null pointer exceptions, have been systematically fixed.

This monthly report highlights the key changes merged into the dev branch during May, including their impact on users, whether upgrades should be considered, and how to validate them.

Monthly Statistics

  • Merged PRs: 50
  • Contributors: 7
  • Code Changes: +10,036 / -8,542
  • Major Modules Involved: API, DAO, Master, Task Plugin, CI/Testing

You will notice that testing-related changes account for a large proportion of the updates. This reflects the community's effort to build a stronger foundation for future iterations. Stable and efficient CI/UT pipelines enable faster feature delivery and more reliable bug fixes.

Who Should Read This?

  • End users and business teams: Want to know which common issues were fixed and whether production environments will become more stable.
  • Operations and platform engineers: Care about failover, permissions, logging, and plugin stability.
  • Developers: Want a quick overview of recent engineering governance efforts, including CI, unit testing, and quality assurance improvements.

The 6 Improvements Users Will Notice Most

1. More Reliable Master Failover

Typical scenario: after a Master node crashes, cluster recovery is slow or failover becomes stuck.

One of May's major fixes addresses failover lock leaks, reducing the likelihood that the scheduler remains unavailable for an extended period after failures.

Related PR:
https://github.com/apache/dolphinscheduler/pull/18207

2. More Rigorous Authorization for Critical APIs

Project-level authorization checks have been added to APIs such as view-gantt, view-variables, and trigger workflow.

This makes the permission model more intuitive: users without proper authorization should not be able to access these resources.

Related PR:
https://github.com/apache/dolphinscheduler/pull/18212

3. Fewer Null Pointer Exceptions in RemoteShell Tasks

Null pointer exceptions in remote tasks are notoriously difficult to troubleshoot due to distributed logs and complex execution contexts.

This month introduces fixes for RemoteShell-related NPEs, making task failures easier to understand and resolve.

Related PR:
https://github.com/apache/dolphinscheduler/pull/18210

4. Improved Dependency Conflict Management for Task Plugins

Plugins such as AliyunServerlessSpark previously suffered from dependency conflicts that could lead to ClassNotFound or compatibility issues.

Enhancements to dependency management and exception handling improve overall plugin reliability.

Related PR:
https://github.com/apache/dolphinscheduler/pull/18180

5. Faster and More Reliable CI and Unit Testing

This is not a user-facing feature, but it matters greatly.

More stable CI pipelines catch problems before code is merged, and stronger testing reduces the likelihood of production incidents.

Related PRs:

6. More Flexible Region and Endpoint Support for AWS S3 Remote Logs

Users relying on S3-compatible storage services or private endpoints now have greater flexibility when configuring regions and endpoints.

This reduces troubleshooting time for connectivity issues caused by storage configuration differences.

Related PR:
https://github.com/apache/dolphinscheduler/pull/18268

Upgrade and Validation Recommendations

This report is based on PRs merged into the dev branch during May 2026, making it valuable for tracking development trends and performing early validation.

If you are running DolphinScheduler in production, prioritize upgrades based on risk:

Recommended for Immediate Attention

  • Master failover improvements
  • Authorization and security-related fixes
  • Task plugin stability enhancements

Can Be Adopted as Needed

  • CI and testing optimizations
  • Documentation and formatting updates
  • Return-type migration and engineering quality improvements

Since all changes were merged into the dev branch, validation in testing or integration environments is recommended:

git fetch origin dev
git checkout dev
git pull --rebase
Enter fullscreen mode Exit fullscreen mode

Focus regression testing on the following scenarios:

  • Master restart and failover recovery
  • Critical API authorization validation
  • Common task plugins such as RemoteShell and ServerlessSpark

Contributor Acknowledgements

Thanks to all contributors who submitted and merged PRs to Apache DolphinScheduler during May 2026.

Your contributions continue to improve the platform's stability, usability, and ecosystem capabilities.

GitHub Username Main Contribution Merged PRs +Lines -Lines Score
@ruanwenjun Test Cases 40 7367 6506 349.69
@SbloodyS Test Cases 4 2503 1988 45.83
@hiSandog Documentation 2 34 7 15.12
@leocook Debug & Fix 1 34 29 9.15
@includetts Debug & Fix 1 16 6 9.06
@llphxd Documentation 1 4 4 9.02
@wcmolin Test Cases 1 78 2 8.26

In-Depth Analysis of Key Technical Changes

A total of 50 PRs were merged this month.

The primary focus areas include:

  • Stability
  • Security and Authorization
  • Plugin Reliability
  • CI and Testing Efficiency

To help readers quickly understand the most important developments, the following section analyzes five representative changes in detail.

1. [Fix-18197][Master] Fix master failover lock leak (#18207)

Background and Challenges

Master failover relies on distributed locks to ensure that failover for a given address is not executed concurrently.

If lock release logic is incorrect, lock nodes may leak, preventing future failover operations and leaving the cluster unable to resume scheduling after failures.

Design and Implementation

The lock acquisition interface was redesigned to return an AutoCloseable handle.

Using try-with-resources guarantees symmetric acquire/release behavior.

Additionally, callers now retain the exact lock path, preventing subtle mistakes such as releasing parent paths.

Suggested Metrics

Simulate failover storms in a three-Master cluster by repeatedly issuing kill -9 and automatic restarts.

Compare:

  • Failover success rate
  • Mean Time To Recovery (MTTR)
  • Failover thread blocking duration

Registry lock node count should also be monitored, as lock leaks accumulate over time.

Compatibility and Rollback

Interface signature changes may affect callers.

Rollback is straightforward but requires cleanup of leaked lock nodes to prevent continued service disruption.

2. [Fix][API] Add missing project authorization on view-gantt/view-variables and trigger workflow APIs (#18212)

Background and Challenges

Workflow APIs without project-level authorization checks can create privilege escalation risks.

In multi-tenant enterprise environments, this becomes a serious security concern.

Design and Implementation

Authorization validation was added to:

  • view-gantt
  • view-variables
  • trigger workflow

Permission checks are enforced consistently through Controller and Service layers.

Suggested Validation

Benchmark authorization overhead before and after implementation.

Security regression tests should include cross-project access attempts.

Best Practices

Enterprise users should enable stricter tenant isolation policies and audit sensitive API operations.

3. [Fix-18201][TaskPlugin] Fix RemoteShell task NullPointerException and… (#18210)

Background and Challenges

RemoteShell tasks are commonly used for operations and integration workloads.

Network interruptions, command output handling differences, and SSH channel inconsistencies can easily lead to NPEs and incomplete logs.

Design and Implementation

Input/output stream handling for SSH channels was improved to eliminate null pointer scenarios.

Exception handling paths were also enhanced to preserve root-cause information.

Suggested Validation

Inject failures such as:

  • Remote disconnections
  • Empty output streams
  • Immediate command termination

Execute 1,000 test runs and compare:

  • NPE occurrence rates
  • Log completeness

Risks and Rollback

Changes are isolated to the plugin layer and are relatively easy to revert.

Regression tests should continue covering:

  • Empty output
  • Large output
  • Non-zero exit codes

4. [Fix-18177][Task Plugin] Fix AliyunServerlessSpark plugin dependency conflicts and improve exception handling (#18180)

Background and Challenges

Dependency conflicts are classic runtime problems that often manifest as:

  • NoSuchMethodError
  • NoSuchFieldError

They are difficult to reproduce because they only occur under specific dependency combinations.

Design and Implementation

Critical dependency versions were corrected and exception wrapping improved.

Users can now directly identify conflicting classes and methods from logs.

Suggested Validation

Execute smoke tests under multiple Hadoop and Spark dependency trees.

Measure:

  • Startup success rate
  • Exception readability
  • Time-to-diagnosis

Best Practices

Production environments should consider dependency isolation techniques such as:

  • Shading
  • Relocation
  • Dedicated ClassLoaders

5. [Chore] Unit-Test performance optimize (#18213)

Background and Challenges

Slow, flaky, or frequently skipped tests delay problem detection until production deployment.

Testing infrastructure directly impacts community development speed and software quality.

Design and Implementation

Unit test execution and CI configurations were optimized.

Temporary safeguards were also introduced to maintain CI stability during environmental issues.

Suggested Validation

Compare:

  • Total CI duration
  • Number of executed unit tests
  • Percentage of skipped tests
  • Flaky test rerun counts

Risks and Rollback

Temporary test disablement should always include a documented recovery plan.

Conditions for re-enabling tests should be tracked through issues and PRs.

Appendix

Top comments (0)