Chen Debra

Posted on Jun 5

Inside DolphinScheduler’s May 2026 Release: Better Failover, Stronger Security, and More Reliable Plugins

#apachedolphinscheduler #opensource #datascience #data

The May 2026 DolphinScheduler community update can be summarized with two keywords: stability and precision.

On one hand, major stability risks such as Master failover issues—which can have a significant impact on production environments when failures occur—have been addressed. On the other hand, long-standing usability problems, including API authorization gaps, plugin dependency conflicts, and RemoteShell null pointer exceptions, have been systematically fixed.

This monthly report highlights the key changes merged into the dev branch during May, including their impact on users, whether upgrades should be considered, and how to validate them.

Monthly Statistics

Merged PRs: 50
Contributors: 7
Code Changes: +10,036 / -8,542
Major Modules Involved: API, DAO, Master, Task Plugin, CI/Testing

You will notice that testing-related changes account for a large proportion of the updates. This reflects the community's effort to build a stronger foundation for future iterations. Stable and efficient CI/UT pipelines enable faster feature delivery and more reliable bug fixes.

Who Should Read This?

End users and business teams: Want to know which common issues were fixed and whether production environments will become more stable.
Operations and platform engineers: Care about failover, permissions, logging, and plugin stability.
Developers: Want a quick overview of recent engineering governance efforts, including CI, unit testing, and quality assurance improvements.

The 6 Improvements Users Will Notice Most

1. More Reliable Master Failover

Typical scenario: after a Master node crashes, cluster recovery is slow or failover becomes stuck.

One of May's major fixes addresses failover lock leaks, reducing the likelihood that the scheduler remains unavailable for an extended period after failures.

2. More Rigorous Authorization for Critical APIs

Project-level authorization checks have been added to APIs such as view-gantt, view-variables, and trigger workflow.

This makes the permission model more intuitive: users without proper authorization should not be able to access these resources.

3. Fewer Null Pointer Exceptions in RemoteShell Tasks

Null pointer exceptions in remote tasks are notoriously difficult to troubleshoot due to distributed logs and complex execution contexts.

This month introduces fixes for RemoteShell-related NPEs, making task failures easier to understand and resolve.

4. Improved Dependency Conflict Management for Task Plugins

Plugins such as AliyunServerlessSpark previously suffered from dependency conflicts that could lead to ClassNotFound or compatibility issues.

Enhancements to dependency management and exception handling improve overall plugin reliability.

5. Faster and More Reliable CI and Unit Testing

This is not a user-facing feature, but it matters greatly.

More stable CI pipelines catch problems before code is merged, and stronger testing reduces the likelihood of production incidents.

Related PRs:

6. More Flexible Region and Endpoint Support for AWS S3 Remote Logs

Users relying on S3-compatible storage services or private endpoints now have greater flexibility when configuring regions and endpoints.

This reduces troubleshooting time for connectivity issues caused by storage configuration differences.

Upgrade and Validation Recommendations

This report is based on PRs merged into the dev branch during May 2026, making it valuable for tracking development trends and performing early validation.

If you are running DolphinScheduler in production, prioritize upgrades based on risk:

Recommended for Immediate Attention

Master failover improvements
Authorization and security-related fixes
Task plugin stability enhancements

Can Be Adopted as Needed

CI and testing optimizations
Documentation and formatting updates
Return-type migration and engineering quality improvements

Since all changes were merged into the dev branch, validation in testing or integration environments is recommended:

git fetch origin dev
git checkout dev
git pull --rebase

Focus regression testing on the following scenarios:

Master restart and failover recovery
Critical API authorization validation
Common task plugins such as RemoteShell and ServerlessSpark

Contributor Acknowledgements

Thanks to all contributors who submitted and merged PRs to Apache DolphinScheduler during May 2026.

Your contributions continue to improve the platform's stability, usability, and ecosystem capabilities.

GitHub Username	Main Contribution	Merged PRs	+Lines	-Lines	Score
@ruanwenjun	Test Cases	40	7367	6506	349.69
@SbloodyS	Test Cases	4	2503	1988	45.83
@hiSandog	Documentation	2	34	7	15.12
@leocook	Debug & Fix	1	34	29	9.15
@includetts	Debug & Fix	1	16	6	9.06
@llphxd	Documentation	1	4	4	9.02
@wcmolin	Test Cases	1	78	2	8.26

In-Depth Analysis of Key Technical Changes

A total of 50 PRs were merged this month.

The primary focus areas include:

Stability
Security and Authorization
Plugin Reliability
CI and Testing Efficiency

To help readers quickly understand the most important developments, the following section analyzes five representative changes in detail.

1. [Fix-18197][Master] Fix master failover lock leak (#18207)

Link: https://github.com/apache/dolphinscheduler/pull/18207
Author: @ruanwenjun
Base/Head: dev ← dev_wenjun_fix18197
Diff Stats: +171 / -10

Background and Challenges

Master failover relies on distributed locks to ensure that failover for a given address is not executed concurrently.

If lock release logic is incorrect, lock nodes may leak, preventing future failover operations and leaving the cluster unable to resume scheduling after failures.

Design and Implementation

The lock acquisition interface was redesigned to return an AutoCloseable handle.

Using try-with-resources guarantees symmetric acquire/release behavior.

Additionally, callers now retain the exact lock path, preventing subtle mistakes such as releasing parent paths.

Suggested Metrics

Simulate failover storms in a three-Master cluster by repeatedly issuing kill -9 and automatic restarts.

Compare:

Failover success rate
Mean Time To Recovery (MTTR)
Failover thread blocking duration

Registry lock node count should also be monitored, as lock leaks accumulate over time.

Compatibility and Rollback

Interface signature changes may affect callers.

Rollback is straightforward but requires cleanup of leaked lock nodes to prevent continued service disruption.

2. [Fix][API] Add missing project authorization on view-gantt/view-variables and trigger workflow APIs (#18212)

Link: https://github.com/apache/dolphinscheduler/pull/18212
Author: @ruanwenjun
Base/Head: dev ← dev_wenjun_fixCvePermissionCheck
Diff Stats: +321 / -16

Background and Challenges

Workflow APIs without project-level authorization checks can create privilege escalation risks.

In multi-tenant enterprise environments, this becomes a serious security concern.

Design and Implementation

Authorization validation was added to:

view-gantt
view-variables
trigger workflow

Permission checks are enforced consistently through Controller and Service layers.

Suggested Validation

Benchmark authorization overhead before and after implementation.

Security regression tests should include cross-project access attempts.

Best Practices

Enterprise users should enable stricter tenant isolation policies and audit sensitive API operations.

3. [Fix-18201][TaskPlugin] Fix RemoteShell task NullPointerException and… (#18210)

Link: https://github.com/apache/dolphinscheduler/pull/18210
Author: @leocook
Base/Head: dev ← fix-18201-remoteshell-npe
Diff Stats: +34 / -29

Background and Challenges

RemoteShell tasks are commonly used for operations and integration workloads.

Network interruptions, command output handling differences, and SSH channel inconsistencies can easily lead to NPEs and incomplete logs.

Design and Implementation

Input/output stream handling for SSH channels was improved to eliminate null pointer scenarios.

Exception handling paths were also enhanced to preserve root-cause information.

Suggested Validation

Inject failures such as:

Remote disconnections
Empty output streams
Immediate command termination

Execute 1,000 test runs and compare:

NPE occurrence rates
Log completeness

Risks and Rollback

Changes are isolated to the plugin layer and are relatively easy to revert.

Regression tests should continue covering:

Empty output
Large output
Non-zero exit codes

4. [Fix-18177][Task Plugin] Fix AliyunServerlessSpark plugin dependency conflicts and improve exception handling (#18180)

Link: https://github.com/apache/dolphinscheduler/pull/18180
Author: @includetts
Base/Head: dev ← fix/aliyun-serverless-spark-deps-v2
Diff Stats: +16 / -6

Background and Challenges

Dependency conflicts are classic runtime problems that often manifest as:

NoSuchMethodError
NoSuchFieldError

They are difficult to reproduce because they only occur under specific dependency combinations.

Design and Implementation

Critical dependency versions were corrected and exception wrapping improved.

Users can now directly identify conflicting classes and methods from logs.

Suggested Validation

Execute smoke tests under multiple Hadoop and Spark dependency trees.

Measure:

Startup success rate
Exception readability
Time-to-diagnosis

Best Practices

Production environments should consider dependency isolation techniques such as:

Shading
Relocation
Dedicated ClassLoaders

5. [Chore] Unit-Test performance optimize (#18213)

Link: https://github.com/apache/dolphinscheduler/pull/18213
Author: @SbloodyS
Base/Head: dev ← ut_performance_optimize
Diff Stats: +22 / -6

Background and Challenges

Slow, flaky, or frequently skipped tests delay problem detection until production deployment.

Testing infrastructure directly impacts community development speed and software quality.

Design and Implementation

Unit test execution and CI configurations were optimized.

Temporary safeguards were also introduced to maintain CI stability during environmental issues.

Suggested Validation

Compare:

Total CI duration
Number of executed unit tests
Percentage of skipped tests
Flaky test rerun counts

Risks and Rollback

Temporary test disablement should always include a documented recovery plan.

Conditions for re-enabling tests should be tracked through issues and PRs.

Appendix

PR #18204: https://github.com/apache/dolphinscheduler/pull/18204
PR #18208: https://github.com/apache/dolphinscheduler/pull/18208
PR #18206: https://github.com/apache/dolphinscheduler/pull/18206
PR #18207: https://github.com/apache/dolphinscheduler/pull/18207
PR #18205: https://github.com/apache/dolphinscheduler/pull/18205
PR #18213: https://github.com/apache/dolphinscheduler/pull/18213
PR #18209: https://github.com/apache/dolphinscheduler/pull/18209
PR #18180: https://github.com/apache/dolphinscheduler/pull/18180
PR #18212: https://github.com/apache/dolphinscheduler/pull/18212
PR #18210: https://github.com/apache/dolphinscheduler/pull/18210
PR #18214: https://github.com/apache/dolphinscheduler/pull/18214
PR #18221: https://github.com/apache/dolphinscheduler/pull/18221
PR #18218: https://github.com/apache/dolphinscheduler/pull/18218
PR #18225: https://github.com/apache/dolphinscheduler/pull/18225
PR #18227: https://github.com/apache/dolphinscheduler/pull/18227
PR #18241: https://github.com/apache/dolphinscheduler/pull/18241
PR #18240: https://github.com/apache/dolphinscheduler/pull/18240
PR #18226: https://github.com/apache/dolphinscheduler/pull/18226
PR #18228: https://github.com/apache/dolphinscheduler/pull/18228
PR #18229: https://github.com/apache/dolphinscheduler/pull/18229
PR #18232: https://github.com/apache/dolphinscheduler/pull/18232
PR #18223: https://github.com/apache/dolphinscheduler/pull/18223
PR #18230: https://github.com/apache/dolphinscheduler/pull/18230
PR #18234: https://github.com/apache/dolphinscheduler/pull/18234
PR #18242: https://github.com/apache/dolphinscheduler/pull/18242
PR #18236: https://github.com/apache/dolphinscheduler/pull/18236
PR #18233: https://github.com/apache/dolphinscheduler/pull/18233
PR #18245: https://github.com/apache/dolphinscheduler/pull/18245
PR #18250: https://github.com/apache/dolphinscheduler/pull/18250
PR #18251: https://github.com/apache/dolphinscheduler/pull/18251
PR #18252: https://github.com/apache/dolphinscheduler/pull/18252
PR #18257: https://github.com/apache/dolphinscheduler/pull/18257
PR #18270: https://github.com/apache/dolphinscheduler/pull/18270
PR #18271: https://github.com/apache/dolphinscheduler/pull/18271
PR #18258: https://github.com/apache/dolphinscheduler/pull/18258
PR #18253: https://github.com/apache/dolphinscheduler/pull/18253
PR #18260: https://github.com/apache/dolphinscheduler/pull/18260
PR #18259: https://github.com/apache/dolphinscheduler/pull/18259
PR #18256: https://github.com/apache/dolphinscheduler/pull/18256
PR #18263: https://github.com/apache/dolphinscheduler/pull/18263
PR #18262: https://github.com/apache/dolphinscheduler/pull/18262
PR #18261: https://github.com/apache/dolphinscheduler/pull/18261
PR #18254: https://github.com/apache/dolphinscheduler/pull/18254
PR #18279: https://github.com/apache/dolphinscheduler/pull/18279
PR #18284: https://github.com/apache/dolphinscheduler/pull/18284
PR #18288: https://github.com/apache/dolphinscheduler/pull/18288
PR #18268: https://github.com/apache/dolphinscheduler/pull/18268
PR #18296: https://github.com/apache/dolphinscheduler/pull/18296
PR #18300: https://github.com/apache/dolphinscheduler/pull/18300
PR #18301: https://github.com/apache/dolphinscheduler/pull/18301