DEV Community: Chen Debra

Upgrading DolphinScheduler Across Major Versions: From 3.1.3 to 3.4.1 via API Automation

Chen Debra — Thu, 21 May 2026 07:22:53 +0000

1. Background: Why Perform a Major Version Upgrade?

Existing Environment

Current DolphinScheduler Version: 3.1.3
Current SeaTunnel Version: 2.1.3

Deployment Scale:
1 Master + 2 Workers, with over 3,700 workflow definitions and more than 20,000 scheduled tasks executed daily.

Years in Production:
The system has been running stably for over three years.

Drivers Behind the Upgrade

Functional Requirements:
With growing business demands, the limitations of the current version became increasingly apparent, including architectural design constraints, metadata database processing bottlenecks, and insufficient server resources.
Community Support:
The official community recommends adopting the latest stable release to obtain better technical support.
Performance Optimization:
Version 3.4.1 delivers significant improvements in scheduling performance and system stability.

Why We Did Not Use the Official Upgrade Path

Large Version Gap:
The upgrade path would require multiple intermediate upgrades:
3.1.3 → 3.2.0 → 3.3.0 → 3.4.1.
Production Environment Constraints:
Multiple maintenance windows were unacceptable due to strict business continuity requirements.
Architectural Change Risks:
Risks included resource center refactoring, metadata schema changes, and compatibility issues.
Workload Considerations:
With thousands of workflows, manually rebuilding tasks would require enormous effort, making automation essential.

2. Overall Migration Strategy: Bypassing the Official Upgrade Path with a “Rebuild + API” Approach

2.1 Core Idea

Instead of pursuing an “in-place incremental upgrade,” we adopted a “new environment + data migration” strategy:

The old 3.1.3 cluster continued running without impacting production workloads.
A brand-new 3.4.1 cluster was deployed to ensure a clean architecture.
Custom scripts were developed to retrieve workflow definitions and task configurations from the old-version APIs.
New-version APIs were used to batch-create workflows in the new cluster.
Business scheduling traffic was gradually switched to the new environment.

2.2 Comparison of Advantages and Risks

Dimension	Official Upgrade Approach	Rebuild + API Approach
Downtime	Multiple upgrades with cumulative downtime potentially lasting hours or even days	Nearly seamless cutover by stopping old schedules and enabling new ones
Rollback Difficulty	Difficult, as the database schema has already changed	Easy, since the old environment remains intact
Data Consistency	Requires validation of all schema migrations	Only core business data (workflow definitions) is migrated; historical execution records are excluded
Version Compatibility	Must handle compatibility issues across all intermediate versions	Directly adapts to 3.4.1 with only necessary parameter transformations
Workload	Requires repeated validation cycles	Effort mainly concentrated on script development
Suitable Scenarios	Minor version upgrades	Major version jumps and large-scale task migration

3. Detailed Implementation Steps

3.1 Environment Preparation

3.1.1 Deploying the New Environment

Deploy a brand-new DolphinScheduler 3.4.1 cluster.
Configure dependencies such as the database and Registry. ZooKeeper was deprecated and replaced with JDBC Registry.
Configure required components such as DataX and SeaTunnel.
Understand key changes in the new version, including:
- SeaTunnel 2.1.3 integration startup mode changed from Spark engine execution to start-seatunnel-spark.sh;
- Default configurations such as tenants, worker groups, and environments are now managed through project preferences;
- Parameter passing behavior changed: downstream tasks must explicitly define IN parameters to receive upstream variable values.
Verify the basic functionality of the new environment and manually validate representative workflows.

3.1.2 API Access Configuration

Configure API access permissions in the new environment by creating new tokens in token management.
Obtain administrator tokens for API calls.
Verify API connectivity.

3.2 Metadata Database Initialization

Replicating Base Tables: Rebuild foundational metadata tables in the new metadata database according to the old-version configuration, including preserving IDs wherever possible. This significantly reduced modification complexity during script-based workflow restoration.

Verification Item	Table
Tenant Table	`t_ds_tenant`
Project Table	`t_ds_project`
User Table	`t_ds_user`
Environment Tables	`t_ds_environment`, `t_ds_environment_worker_group_relation`
Worker Group Table	`t_ds_worker_group`
Data Source Table	`t_ds_datasource`

3.3 Migration Script Development

3.3.1 Preliminary Preparation and Testing

Categorize workflows into template-based and non-template-based tasks.
Select representative workflows and execute them in the new environment to verify successful execution and data synchronization accuracy.

3.3.2 Code Development — Reading Original Workflow Definitions

...
 // Retrieve all workflows and process them iteratively
        String processDefinitionUrl = OLD_URL + "/dolphinscheduler/projects/" + oldProjectCode +
                "/process-definition/query-process-definition-list";
        Map<String, String> map = new HashMap<>();
        map.put("projectCode", oldProjectCode);
        String pdRes = httpClientUtilOld.doGetRequest(processDefinitionUrl, map);
        ArrayList<JSONObject> dataList = parseResDataToList(pdRes);
        for (JSONObject job : dataList) {
            String oldWFCode = job.get("code").toString();
            Map<String, String> mapPara = new HashMap<>();
            String oldurl = OLD_URL + "/dolphinscheduler/projects/" + oldProjectCode
                    + "/process-definition/" + oldWFCode;
            mapPara.put("code", oldWFCode);
            mapPara.put("projectCode", oldProjectCode);

            String res = httpClientUtilOld.doGetRequest(oldurl, mapPara);
            JSONObject jsonObject = JSON.parseObject(res);
            JSONObject data = (JSONObject) jsonObject.get("data");
            JSONObject processDefinition = data.getJSONObject("processDefinition");
            JSONArray processTaskRelationList = data.getJSONArray("processTaskRelationList");
            JSONArray taskDefinitionList = data.getJSONArray("taskDefinitionList");

            // TODO: generate new task codes and replace old ones
            // Populate workflow information and create workflows
            createWF(processDefinition, processTaskRelationList, taskDefinitionList, NEW_IP, newProjectCode);

3.3.3 Creating Workflows via API

 // Generate task codes based on task count
        int taskCnt = taskDefinitionList.size();
        List<String> taskCodeList = taskDefinitionList.stream()
                .map(obj -> (JSONObject) obj)
                .map(obj -> obj.getString("code"))
                .collect(Collectors.toList());

        try {
// TODO: generate task codes
            String taskCodeUrl = NEW_URL + "/dolphinscheduler/projects/" + newProjectCode + "/task-definition/gen-task-codes";
            HashMap<String, String> taskCodeMap = new HashMap<>();
            // Generate n task codes
            taskCodeMap.put("genNum", String.valueOf(taskCnt));
            String codeData = httpClientUtilNew.doGetRequest(taskCodeUrl, taskCodeMap);
            Object codes = JSON.parseObject(codeData).get("data");
            JSONArray taskCodeArr = JSON.parseArray(codes.toString());

// Add downstream input parameters based on actual task requirements
            for (int i = 0; i < taskDefinitionList.size(); i++) {
                JSONObject logTask = (JSONObject) taskDefinitionList.get(i);
                if (“Condition Logic”)) {
                    JSONObject taskParams = logTask.getJSONObject("taskParams");
                    JSONArray localParams = taskParams.getJSONArray("localParams");

                    JSONObject hiveParam = new JSONObject();
                    hiveParam.put("prop", "hiveAmount");
                    hiveParam.put("direct", "IN");
                    hiveParam.put("type", "VARCHAR");
                    hiveParam.put("value", "");
                    localParams.add(hiveParam);

                    logTask.put("taskParamList", localParams);

                    JSONObject paramMap = new JSONObject();
                    for (Object obj : localParams) {
                        JSONObject param = (JSONObject) obj;
                        paramMap.put(param.getString("prop"), param.getString("value"));
                    }

                    logTask.put("taskParamMap", paramMap);
                    ....


// Replace required parameters such as task code and SeaTunnel execution engine
 for (int i = 0; i < taskCodeList.size(); i++) {
                String oldCode = taskCodeList.get(i);
                String newCode = taskCodeArr.getString(i);

                // Replace task code
                // Replace SeaTunnel engine: "SPARK" -> "start-seatunnel-spark.sh"
                taskDefinitionListJsonStr = taskDefinitionListJsonStr
                        .replace("\"code\":" + oldCode + ",", "\"code\":" + newCode + ",")
                        .replace("\"engine\":\"SPARK\",", "\"startupScript\":\"start-seatunnel-spark.sh\",");

                taskRelationListJsonStr = taskRelationListJsonStr
                        .replace("TaskCode\":" + oldCode + ",", "TaskCode\":" + newCode + ",");

                locationsJsonStr = locationsJsonStr
                        .replace(oldCode, newCode);

...
  }
            Map<String, String> map = new HashMap<>();
            map.put("taskDefinitionJson", taskDefinitionListJsonStr);
            map.put("taskRelationJson", taskRelationListJsonStr);
            map.put("locations", locationsJsonStr);

            map.put("name", processDefinition.getString("name"));
            map.put("tenantCode", "omm");
            map.put("executionType", processDefinition.getString("executionType"));
            map.put("description",
                    processDefinition.getString("description") == null ? "" : processDefinition.getString("description"));
            map.put("globalParams", processDefinition.getString("globalParams"));
            map.put("timeout", processDefinition.getString("timeout"));

            String processDefinitionUrl = NEW_URL + "/dolphinscheduler/projects/" + newProjectCode + "/workflow-definition";
            String processDefinitionRes = httpClientUtilNew.doPostRequest(processDefinitionUrl, map);

3.4 Migration Execution

3.4.1 Migration Procedure

Back Up Old Scheduling Tables:
Example: t_ds_schedules_20260416_10
Select a Pilot Project:
Choose a project with moderate workload and limited business impact.
Migrate Workflow Definitions:
Migrate approximately 200 workflow definitions, including scheduling configurations.
Deploy Workflows Without Enabling Schedules:
Deploy workflows first without activating schedules.
Manual Validation:
Execute workflows manually in batches and verify conflicts with the original cluster. Since most workflows run daily, hourly, or every 15 minutes, conflicts were minimal.
Investigate Failed Tasks:
Analyze root causes, fix issues, and rerun failed workflows.
Enable Scheduling Configurations:
Enable schedules after all workflows pass validation.
Disable Old Cluster Scheduling:
After confirming stable operation in the new environment, disable corresponding schedules in the old cluster.
Migrate Project by Project, Batch by Batch

3.4.2 Migration Execution Results

Following this process, we first selected a representative project containing 199 workflows. After migration, it was tested in production for one week without issues.

Subsequently, we completed migration for 50 projects, totaling approximately 3,700 workflows, within about 10 days.

Tracking Table

No.	Project Name	Project Code	Workflow Count	Progress	Remarks
1	Project 1	13*******	199	Completed on 04/16	One-week stability test completed
2	Project 2	...	...	...	...
...	...	...	...	...	...
50	...	...	...	...	...

3.4.3 Runtime Status

The new cluster has now been running for nearly one month.

Previously, scheduling delays ranged from 10 seconds to over one minute in severe cases. After the upgrade, scheduling latency has been virtually eliminated.

Issues related to missing scheduling instances have also not reoccurred.

So far, the system has been running smoothly without any identified problems, and continuous monitoring remains in place.

4. Risk Control and Contingency Planning

4.1 Major Risks

Risk of Data Loss:
Some configurations could potentially be missed during migration.

Compatibility Issues:
Certain configurations supported in the old version may not be supported in the new version.

Business Interruption Risks:
Scheduling delays could occur during the switchover process.

4.2 Contingency Plans

4.2.1 Rollback Strategy

Immediately stop scheduling in the new environment.
Restore scheduling services in the old environment.
Analyze root causes and retry after issue resolution.

4.2.2 Data Backup

Perform a complete backup of the old environment database.
Back up the initial configuration of the new environment.

5. Conclusion

5.1 Project Outcomes

Successfully completed a cross-version upgrade with zero business interruption.
Automated migration scripts significantly improved efficiency and reduced manual errors.
The new version delivered major performance gains and substantial stability improvements.

5.2 Lessons Learned

Major version upgrades require comprehensive evaluation of architectural changes.
API-based migration is highly suitable for configuration migration, though parameter compatibility must be handled carefully.
Thorough testing and validation are critical to success.
A robust monitoring system is essential for operational stability.
Comprehensive documentation is invaluable for long-term maintenance.

5.3 Future Plans

The current upgrade primarily addressed DolphinScheduler scheduling bottlenecks. To align with upcoming Spark cluster upgrades, the next step will be upgrading SeaTunnel from version 2.1.3 to 2.3.12, most likely using the same migration methodology.
Explore automated testing solutions.
Share migration experience with other teams.

DolphinScheduler Agent Is Now Open-Source! Bringing Self-Healing Automation to DataOps

Chen Debra — Mon, 18 May 2026 03:17:26 +0000

At the 2026 Apache DolphinScheduler Meetup technical session, the DolphinScheduler Agent solution presented by Liu Xiaodong immediately became one of the hottest topics in the community. This end-to-end system, connecting “group alert → intelligent diagnosis → automatic recovery → reporting loop,” effectively solves the fragmentation, high manual overhead, and constant context switching of traditional operations workflows, bringing big data incident handling from the era of “manual firefighting” into the age of “intelligent autonomous operations.”

The project’s core supporting tool, dolphinscheduler-cli (dsctl), has now officially been open-sourced on GitHub and is freely available for all developers!

Watch the Replay: https://youtu.be/mnGC-XOf8xU

The Pain of Traditional Operations: Slow Recovery Isn’t About Commands — It’s About Fragmented Context

When using Apache DolphinScheduler in daily production, handling failed tasks has always been a major burden for operations teams.

The workflow is all too familiar:

A Feishu alert pops up → open the DS UI to check instance status → inspect logs to locate the failure → compare with the Runbook → manually decide what to do → return to the group chat and reply with the result...

What truly slows down efficiency is not executing a command itself, but the constant loss of context across multiple systems. Facts, evidence, and risks are scattered across different tools, forcing operators to spend enormous time “searching for information, stitching logic together, and rebuilding context.” Collaboration breaks frequently, troubleshooting costs soar, and incident recovery cycles become unnecessarily long.

With DolphinScheduler Agent, all of this changes.

A Major Upgrade: From Fragmented Human Coordination to an Intelligent End-to-End Closed Loop

To solve these operational gaps, the goal of the DolphinScheduler Agent solution is crystal clear:

Transform every failure alert into a continuous, traceable, and reusable handling workflow.

The old model treated alerts, UI pages, logs, group chats, and postmortems as isolated systems heavily dependent on human coordination.

The new model starts from a Feishu alert and flows through Channel conversations, intelligent orchestration, execution control, verification, and automated reporting, forming a seamless end-to-end process from trigger to resolution — without requiring engineers to jump repeatedly between systems.

Simply put:

Once an alert is triggered, the Agent automatically takes over.
Once handling is complete, it automatically replies in the group and generates a full incident report.

Operations engineers only need to review the conclusion instead of “running around everywhere.”

Five-Layer Core Architecture: Not Just Scripts, but a Safe and Controllable Intelligent Control Chain

Many people mistakenly think automated operations simply mean “bots + scripts.”

However, DolphinScheduler Agent takes a much more robust and engineering-oriented approach: a five-layer decoupled control chain. Each layer has clear responsibilities, ensuring both execution capability and strict safety boundaries.

1. L1 Event & Collaboration

Alerts directly enter Feishu threads, allowing human intervention and questioning at any time. The workflowInstanceId serves as the unique incident anchor, ensuring information is never lost or fragmented.

2. L2 Session Integration

Feishu events synchronize into local sessions, maintaining full conversational context and eliminating interruptions caused by switching systems.

3. L3 Intelligent Orchestration

Claude Code handles information organization and invocation orchestration, while Skills encapsulate DolphinScheduler domain expertise for more accurate decision-making.

4. L4 Execution Control

dsctl centrally handles the core actions of evidence collection, fault recovery, and result verification, providing standardized, reusable, and stable command execution.

5. L5 Governance & Reporting

The system automatically generates Feishu replies, incident reports, and audit logs, balancing real-time collaboration with long-term governance and postmortem analysis.

This architecture directly addresses real operational requirements:
Only through decoupled architecture can capabilities scale reliably; only through clear boundaries can automation safely enter production environments.

Four Core Modules: Making Self-Healing Truly Production-Ready

Built on top of the five-layer architecture, four tightly integrated modules make the system practical, scalable, and trustworthy.

📌 Channel: Native Feishu Entry Point for Unified Collaboration

Feishu groups become the alert entrance, collaboration interface, and result feedback page all in one.

Agents, humans, and on-call workflows collaborate within the same thread. Group chats display concise conclusions, while detailed evidence is preserved in reports for future reference.

📌 Runtime: Intelligent Orchestration Engine with Decoupled Rules and Execution

Claude Code manages conversation orchestration logic, while Skills encapsulate operational expertise such as fault response, workflow design, and data quality governance.

By separating orchestration, rules, and execution into independent layers, the system becomes highly extensible and continuously evolvable.

📌 Control Plane: dsctl as the Unified Execution Foundation

dsctl is the core execution engine powering the entire Agent system.

It provides standardized CLI capabilities that can be safely invoked by automation systems:

Evidence collection: doctor / digest / log
Fault repair: recover-failed / edit --dry-run
Result verification: watch / digest
Unified outputs: fully observable, traceable, and auditable

With dsctl, manual commands become stable automation primitives.

A Seven-Step Standard Closed Loop: Dual-Path Protection for Production Safety

From alert triggering to incident reporting, the Agent strictly follows a seven-step state machine:

Alert Parsing → Diagnosis → Decision → Execution → Verification → Response → Reporting

Two execution paths guarantee safety:

Happy Path
For low-risk scenarios with sufficient evidence:
collect evidence → generate execution plan → recover failed tasks → verify → reply in the group → generate report
Escalation Path
For insufficient evidence, high-risk situations, or failed verification:
escalate to human operators while preserving complete context — never falsely reporting success.

Everything is traceable, auditable, and reviewable, enabling safe and stable production deployment.

📌 Safety: Four-Level Risk Governance — Safety Comes First

In production automation, safety always matters more than speed.

The system classifies operations into four risk levels:

Automatically Allowed: read-only queries and log viewing
Automatic + Protection: low-risk recovery operations like recover-failed
Human Approval Required: high-risk modifications
Forbidden: dangerous operations such as force-success are directly blocked

This defines the system’s core philosophy:

The true strength of an Agent is not “daring to execute,” but knowing “when not to execute.”

A Pragmatic Roadmap: Gradual Delegation Toward Autonomous Operations

To ensure safe production adoption, the Agent follows a gradual empowerment strategy:

MVP Stage: read-only diagnosis + automated short replies
V1 Stage: enable low-risk automatic recovery via recover-failed
V2 Stage: integrate approval mechanisms for broader controllable operations
V3 Stage: accumulate Runbooks and Skills for community collaboration

The true value of this solution is not a single prompt, but an entire engineering framework built around:

Channel + Skill + CLI + Report + Safety

A reusable and portable operational architecture.

Demo

To help the audience better understand DolphinScheduler Agent’s capabilities, Liu Xiaodong also demonstrated a live demo during the session.

Please refer to the video starting from 57:10 for the full demonstration.

🎉 Official Open Source Release: dsctl Is Now Available on GitHub

The great news is that the core project powering DolphinScheduler Agent — dolphinscheduler-cli (dsctl) — has officially been open sourced!

GitHub Repository:
dolphinscheduler-cli GitHub Repository

The project provides a complete CLI toolkit supporting:

DolphinScheduler configuration and environment management
Workflow authoring, lint checking, and DryRun simulation
Runtime monitoring, instance inspection, and log retrieval
Failure recovery, rerun handling, and batch operations
Standardized outputs fully compatible with automation and Agent integration

The project is released under the Apache-2.0 license, supports one-line installation via pip, and is compatible with mainstream DolphinScheduler versions including 3.3.2, 3.4.0, and 3.4.1.

Final Thoughts

DolphinScheduler Agent is redefining the operational paradigm for big data systems:

Free people from repetitive tasks, fragmented workflows, and endless context switching — let systems handle incidents, while humans focus on decision-making and governance.

From alert triggering to automatic recovery, automated replies, and report generation, the entire process becomes a seamless one-click closed loop.

If everything runs smoothly, operations teams really can “lie back and let the system do the work.”

Developers, operators, and big data engineers are all welcome to explore dsctl on GitHub and join the community in building a simpler, smarter, and more efficient future for operations.

Part 9 | Beyond Scheduling: How Data Platforms Evolve into DataOps Systems

Chen Debra — Fri, 24 Apr 2026 02:20:41 +0000

In the continuous evolution of data platforms, many teams encounter a critical turning point: the scheduling system is already stable, and tasks run on time, yet overall efficiency does not improve. Instead, as the scale grows, the system becomes increasingly difficult to maintain. The root cause is that the platform still operates at the level of “task scheduling” rather than advancing to the level of “engineering governance.”

This article focuses on that transformation—how scheduling evolves from an execution tool into the core platform supporting DataOps, along with the key methodologies and practical approaches involved. It also uses Apache DolphinScheduler as a concrete example to illustrate this transition.

The Evolution of the Scheduler’s Role

At the beginning, scheduling systems were essentially enhanced tools for timed execution. Tasks existed in the form of scripts, triggered by time, with little to no clear dependency relationships between them. This model worked when the number of tasks was small, but as data pipelines became more complex, issues began to emerge: tasks affected each other without visibility, retry strategies were lacking, and pipeline states were difficult to trace.

To address these problems, scheduling systems gradually introduced workflow orchestration mechanisms, organizing tasks into Directed Acyclic Graphs (DAGs), enabling structured representation of data processing flows. For example, a standard ETL process can be clearly connected through dependencies.

At this stage, the key improvement is that scheduling is no longer just a “trigger,” but becomes the “organizer” of data workflows. However, it still remains at the execution layer and does not solve deeper management challenges.

Engineering Transformation Driven by Standards

As the number of tasks continues to grow, teams often realize that the real bottleneck is not scheduling capability, but the disorder of tasks themselves. The same data is repeatedly developed, naming conventions vary across tasks, code reuse is limited, and lineage relationships are difficult to track. At the core, these issues stem from a lack of unified standards.

As a result, the focus of platform development shifts from “enhancing scheduling capabilities” to “establishing engineering standards.” By abstracting a unified development model and standardizing the data processing workflow, maintainability can be significantly improved. For instance, tasks can be uniformly divided into three stages: extract, transform, and load.

Based on this abstraction, individual tasks only need to implement their own logic, avoiding repetitive development.

Once these standards are gradually implemented, tasks are no longer scattered scripts but become structured engineering units, laying the foundation for subsequent governance capabilities.

How Scheduling Platforms Support Engineering Governance

After task standardization is achieved, the role of the scheduling platform undergoes a qualitative transformation. It is no longer just responsible for executing tasks but becomes the control center of the entire data engineering process. By centrally managing task metadata—such as owners, retry strategies, and priorities—the platform enables full lifecycle control over tasks.

At the same time, dependency relationships built through workflows naturally form data lineage, supporting impact analysis and issue diagnosis.

Observability becomes a critical capability at this stage. By continuously monitoring metrics such as execution duration, success rate, and resource consumption, the platform can proactively identify risks. For example, adding simple monitoring logic during execution allows timely alerts when anomalies occur:

def monitor(task):
    if task.duration > threshold:
        alert("task timeout")

    if task.failed:
        send_notification(task.owner)

Furthermore, when the scheduling platform is integrated with code repositories, data development can be incorporated into CI/CD processes, enabling automated validation and deployment. Every change is recorded, and every release is verified, gradually bringing data development in line with software engineering practices.

DataOps Practices with Apache DolphinScheduler

When applying the above concepts to a real system, Apache DolphinScheduler provides a representative implementation path. It is not merely a scheduling tool but has progressively evolved to include key capabilities of a DataOps platform.

First, in terms of task standardization, DolphinScheduler defines a hierarchical structure of “project–workflow–task,” clearly separating development boundaries, resource isolation, and execution units. Each task must specify execution type, resources, retry strategies, and other metadata. This effectively enforces engineering standards rather than allowing arbitrary script integration.

Second, in workflow governance, DolphinScheduler uses visual DAG orchestration to clearly represent complex dependencies. For example, a typical data pipeline can be defined programmatically:

workflow = {
    "name": "user_pipeline",
    "tasks": [
        {"name": "extract", "type": "spark"},
        {"name": "transform", "type": "spark"},
        {"name": "load", "type": "spark"}
    ],
    "dependencies": [
        ("extract", "transform"),
        ("transform", "load")
    ]
}

This structure is not only used for execution but can also support lineage analysis and impact assessment.

Furthermore, in terms of resource governance, DolphinScheduler integrates with underlying resource management systems such as YARN or Kubernetes. Through tenant mechanisms, scheduling maps directly to actual computing resources. This means scheduling is not just about “arranging tasks,” but about controlling resource boundaries and preventing interference between tasks.

In terms of observability, DolphinScheduler provides built-in capabilities such as task logs, execution tracking, and alerting mechanisms, making task execution traceable and auditable. When a node fails, engineers can quickly locate the specific task instance instead of manually searching through logs.

Finally, in engineering capabilities, DolphinScheduler integrates with code management systems to support version control and release management of workflows. Through APIs or automation pipelines, it enables a complete delivery lifecycle from development to testing to production, which is a core aspect of “continuous delivery” in DataOps.

The Evolution Path of Enterprise Data Platforms

From a broader perspective, enterprise data platforms typically evolve through a progressive process. They start with simple script-based and time-triggered systems, then move to workflow-oriented scheduling platforms, further incorporate metadata management and access control, and ultimately evolve into DataOps platforms with automation, observability, and governance capabilities.

The essence of this evolution is the continuous upward shift of focus—from “whether tasks run” to “whether data is reliable,” and finally to “whether engineering is governable.” Each stage reduces complexity while improving controllability and system stability.

A Governable Data Task in Practice

When these concepts are applied in practice, it becomes possible to build data tasks with governance capabilities. Before execution, schema validation can be performed; after execution, runtime metrics can be reported, ensuring full lifecycle control.

At the scheduling layer, task behavior is constrained through unified configurations such as SLA, retry strategies, and alert mechanisms. This approach ensures that tasks no longer depend on individual experience but operate within a standardized governance framework.

Conclusion

The ultimate goal of a scheduling system is never just to “run tasks faster,” but to “make data development manageable.” When a platform can enforce standards, organize workflows, ensure stability through monitoring, and support evolution through automation, it has completed the transformation from scheduling to DataOps.

Scheduling systems represented by Apache DolphinScheduler are evolving from the execution layer to the governance layer—marking the true arrival of the DataOps era.

How A Leading Manufacturing Enterprise in Shenzhen Deploys Apache DolphinScheduler Across Dozens of Factories Within One Day?

Chen Debra — Fri, 17 Apr 2026 07:24:25 +0000

https://youtu.be/OKjCaqQgHoU

As the wave of digital transformation sweeps across the globe, intelligent manufacturing has become the core engine driving high-quality growth in the manufacturing industry.
However, on the path toward intelligence, enterprises are facing a wide range of challenges: data silos across multiple systems, complex scheduling dependencies, and delayed monitoring and alerting issues continue to emerge.

At a recent Apache DolphinScheduler online user meetup, the community invited Qiu Zhongbiao, a senior software engineer from a large intelligent manufacturing enterprise in Shenzhen.

During the session, he delivered a detailed sharing on the practical application of Apache DolphinScheduler in real manufacturing scenarios.

This article organizes the key content from that talk to explore how this enterprise achieved a qualitative leap in its scheduling platform with Apache DolphinScheduler.

About the Author
Qiu Zhongbiao is a senior software engineer at a large intelligent manufacturing enterprise in Shenzhen.

He focuses on data technology research and practice in the field of intelligent manufacturing.

He is dedicated to promoting the digital transformation of the manufacturing industry.

The Era of Intelligent Manufacturing

With the continuous advancement of Industry 4.0, intelligent manufacturing has become the focus of global competition in the manufacturing sector.

The maturity model of intelligent manufacturing is divided into multiple levels from low to high.

Enterprises need to progressively improve their capabilities in automation, digitalization, and networking, and ultimately achieve fully intelligent production.

In this process, data becomes a core production factor.

How to efficiently, stably, and reliably collect, process, and schedule this data has become a critical challenge faced by every manufacturing enterprise.

The data environment in modern manufacturing enterprises is becoming increasingly complex.
On one hand, enterprises operate a large number of business systems, including MES (Manufacturing Execution System), ERP (Enterprise Resource Planning), WMS (Warehouse Management System), WCS (Warehouse Control System), CRM (Customer Relationship Management), QMS (Quality Management System), PLM (Product Lifecycle Management), SCM (Supply Chain Management), and APS (Advanced Planning and Scheduling).

Data exchange between these systems is often implemented through hard-coded integrations.
This leads to highly complex inter-system relationships, high maintenance costs, poor scalability, and difficulty in troubleshooting.

On the other hand, enterprises also face complex network environments.

These include corporate production networks, factory internal networks, and international/domestic dedicated-line networks.

Different network environments have different requirements for data collection, transmission, and scheduling.

How to achieve unified management and task isolation under such conditions becomes a major challenge.

Challenges of Traditional Data Processing Approaches

In the process of promoting data-driven transformation in intelligent manufacturing, enterprises are facing pain points across multiple dimensions.

Category	Details
Data Diversity	1. Protocol complexity: device layer uses proprietary protocols such as PLM/S7, edge layer uses MQTT/COAP, and system layer uses REST/SOAP. 2. Data format heterogeneity: device data includes binary and hexadecimal formats, while database tables are often semi-structured formats such as JSON/XML. 3. Vendor differences: multiple vendors for robots and devices, with significant variations across production lines.
Cross-System / Cross-Factory Collaboration	1. Complex data links: involving devices, gateways, local systems, MES, SAP, ASP, WMS, and remote factories. 2. Mixed network environments: factory intranet, on-site servers, cross-factory dedicated lines, public network, and international network connections. 3. High real-time requirements: production scheduling, capacity planning, and other business functions demand strong timeliness.
Lack of Visualization & Traceability	1. Invisible data pipelines: traditional systems cannot visually display data processing flows. 2. Disconnected logs: data transmission between systems relies on manual logging, making it difficult to store and track complete logs across all nodes. 3. Difficult traceability: tracking data flow across systems requires manual effort and high labor costs.
Unreliable Data Collection Quality	1. Diverse anomalies: network failures, device errors, system exceptions, and duplicate data collection. 2. Delayed issue detection: multiple anomalies are often discovered only after they impact downstream systems, relying on manual intervention. 3. Difficult root cause analysis: multi-system interactions make it hard to locate faults, requiring full-chain understanding of data flows.

First is the foundational barrier caused by data diversity.

Device protocols are highly diverse, covering proprietary protocols such as PLM/S7 as well as general protocols like MQTT.

Data formats include binary data and semi-structured data.

Combined with differences among vendors and production lines, this makes it extremely difficult to standardize data.

On top of that, cross-system and cross-factory data collaboration is particularly challenging.

Data links involve multiple stages, including devices, various systems, and geographically distributed factories.

Network environments are mixed, including intranets, dedicated lines, and the public internet.

At the same time, business scenarios such as production scheduling and capacity calculation have very high requirements for real-time data.

All of these factors further increase the complexity of collaboration.

Meanwhile, data visualization and traceability capabilities are insufficient.
Traditional systems cannot intuitively present data flow nodes.
Logs are stored in a scattered manner, leading to inefficient troubleshooting.
Building a complete traceability system also requires significant manual effort.

Finally, the quality of data collection lacks guarantees.

Various anomalies frequently occur due to networks and devices.

Detection of these anomalies is often delayed.

Manual recovery is inefficient.

In multi-system interactions, fault localization still relies heavily on familiarity with the entire data pipeline.

All of these issues further impact data reliability.

The Apache DolphinScheduler Solution

In response to the above challenges, Apache DolphinScheduler provides a comprehensive solution.
As a distributed, highly extensible, and visual workflow scheduling platform, it demonstrates strong capabilities in manufacturing scenarios.

Worker Node Grouping: A Solution for Complex Network Environments

In terms of Worker node grouping, Apache DolphinScheduler provides a flexible isolation strategy tailored to complex network environments in manufacturing enterprises.

Worker nodes can be grouped by network environments, such as corporate production network Workers, factory internal network Workers, and international/domestic dedicated-line Workers.
They can also be grouped by business types, such as PLC device data collection, production data processing, and quality data analysis.

This enables task isolation across different network environments and business scenarios.
It ensures the security and reliability of data collection.

This solution effectively supports key application scenarios such as production data lake ingestion, customer data feedback, and cross-network data synchronization.

Data Collection

In terms of data collection, Apache DolphinScheduler builds a complete data processing pipeline.

The data source layer includes IoT devices, such as device sensors, heartbeat data, status monitoring, and device operation data.
It also includes business systems such as MES, WMS, ASP, and SAP databases.
In addition, it includes AGENT probes and user-uploaded data.

The processing layer uses DataX for offline data synchronization.
It uses Flink for real-time stream processing.
Kafka is used as a message queue buffer.

Finally, data is unified into a data lake.
This supports BI analysis and AI applications.

Through unified scheduling with Apache DolphinScheduler, enterprises can achieve end-to-end management from data collection to processing to application.

Data Interaction

In the traditional model, systems interact with each other in a point-to-point manner.

This leads to highly complex relationships between systems.

After introducing Apache DolphinScheduler, all data interactions are unified through the scheduling center.

This enables centralized management of all data interaction tasks.
It allows visual monitoring of task execution status.
It provides unified exception handling and alerting mechanisms.

At the same time, it reduces coupling between systems.
It improves the reliability of data interactions.

Template-Based Data Collection and Distribution Across Multiple Factories

For manufacturing enterprises with multiple factories, Apache DolphinScheduler provides a template-based solution.

For homogeneous systems, such as unified MES/WMS systems or the same types of PLC devices, how can rapid deployment be achieved?

The approach is to solidify core processes into reusable templates.
These processes include reading task lists, parameter injection, execution of data collection or distribution, and completion or exception marking.

At the same time, task configuration tables are introduced.
These include data source configurations, SQL statements, system IDs for distribution or collection, custom parameters, and checkpoint settings.

This enables a flexible model of “template standardization + parameter customization.”

This template-based solution brings several significant advantages.
First, parameterized configuration allows the core process to be standardized as a template, while factory-specific parameters such as IP addresses, accounts, and paths are configured separately.
Second, batch deployment capability allows enterprises to complete deployment across dozens of factories within one day, greatly improving efficiency.
Third, a unified iteration mechanism ensures that when templates are updated, all factories are automatically synchronized without the need for manual adjustments.
Fourth, flexible extensibility supports template version management, allowing customized templates to be derived for different factories based on a base template.
For example, some factories may require additional data fields.
Fifth, cross-scenario support enables both “multi-factory data collection to headquarters” and “headquarters data distribution to multiple factories,” such as unified production plan distribution.

A Qualitative Leap: From Manual Workshop to Industrial Pipeline

After introducing Apache DolphinScheduler, the enterprise achieved a qualitative leap in data processing.

Dimension	Traditional Coding	Apache DolphinScheduler
Development Efficiency	Requires writing data processing logic, exception handling, retry logic, etc.; high human effort	Drag-and-drop configuration, built-in components and plugins, development completed in stages
Dependency Management	Difficult to handle complex task dependencies; prone to issues such as missing or inconsistent dependencies	Visual DAG-based workflow orchestration
Monitoring & Alerting	Requires custom development of monitoring or logging, leading to lagging issue detection	Built-in monitoring, real-time task execution status, logs, and alert notifications
Fault Tolerance & Retry	Requires manual modification of code/scripts; complex recovery process	One-click retry/stop; built-in fault-tolerant retry mechanisms
Resource Scheduling	Lacks unified management; prone to CPU/memory contention and uneven resource allocation	Distributed, centralized resource management; dynamic scaling via integration with compute engines

In the traditional approach, developers needed to write code for data connections, exception handling, and retry logic modules.
This required significant human effort.

In contrast, Apache DolphinScheduler uses a drag-and-drop configuration approach.
It comes with numerous built-in plugins.
Development tasks can be completed within minutes.

In terms of dependency management, traditional approaches struggle to handle complex cross-system scheduling.

Issues such as idempotency and consistency must be considered.
This makes the process error-prone.

In contrast, Apache DolphinScheduler provides intuitive and convenient visual DAG operations.

The improvement in monitoring and alerting capabilities is particularly significant.

Traditional approaches require developers to write monitoring scripts or manually check logs.

This leads to delayed fault detection and resolution.

Apache DolphinScheduler comes with built-in monitoring capabilities.

It supports real-time viewing of task execution status and logs.

It can also integrate with multiple alerting channels such as WeCom, DingTalk, and email.

In terms of fault tolerance and recovery, traditional approaches require manual modification of code and scripts.

Data recovery logic is complex.

Apache DolphinScheduler provides one-click rerun and stop functions.

It also includes built-in automatic retry mechanisms for failures.

Resource scheduling capabilities are also greatly improved.

Traditional approaches lack unified resource management.

This often leads to CPU and memory overload on single machines, causing crashes.
Distributed approaches also consume significant resources.

Apache DolphinScheduler adopts a distributed and decentralized cluster management architecture.

It supports rapid dynamic scaling through monitoring.
It enables fine-grained resource management.

These improvements bring real value at multiple levels.

Development	Business	Decision Layer
1. Drag-and-drop development	1. Visualized monitoring	1. De-personalization (processes not dependent on individuals)
2. Automated parameterization	2. Alert assurance	2. Operation auditing
3. Log-based issue localization	3. Flexible parameters	3. Data security (centralized data configuration)
4. Low O&M cost	4. Cross-system orchestration	4. Elimination of black-box operations
—	5. Reduced development dependencies	5. Resource utilization & measurability

At the development level, drag-and-drop workflows lower the technical barrier.

Parameter automation improves development efficiency.

Second-level log tracing shortens troubleshooting time.

Operational costs are significantly reduced.

At the business level, visual monitoring provides a clear view of task status.

Multi-channel alerting ensures timely response to issues.

Flexible data recovery strategies handle various anomalies.

Cross-system coordination enables unified data flow management.

Dependence on individual developers is reduced.

At the decision-making level, knowledge is no longer tied to individuals.

It becomes an organizational asset.

Complete audit logs meet compliance requirements.

Centralized database configuration reduces security risks.

Transparent workflows make management and optimization easier.

Quantified resource usage supports refined decision-making.

These values together form a solid foundation for enterprise digital transformation.

Results and Future Outlook

Through the practical application of Apache DolphinScheduler, this intelligent manufacturing enterprise has achieved significant improvements across multiple dimensions.

These include improved development efficiency, shortened deployment cycles, significantly reduced operational costs and manpower, and greatly increased task success rates.

At the same time, the system supports rapid scaling.

New factories can be deployed within one day.

This enables standardized processes, transparent management, and data-driven decision-making.

Looking ahead, as intelligent manufacturing continues to advance, data scheduling will play an increasingly important role.

As an open-source project, Apache DolphinScheduler will continue to evolve in multiple directions.

In terms of AI enablement, it will introduce AI capabilities to achieve intelligent scheduling and predictive maintenance.

In terms of cloud-native architecture, it will deeply adapt to cloud-native environments to improve elasticity and scalability.

In terms of ecosystem expansion, it will enrich the plugin ecosystem to cover more business scenarios.

Conclusion

In the journey of intelligent manufacturing, data scheduling is not the destination, but the starting point.

Apache DolphinScheduler helps enterprises solve the “last mile” problem of data processing.
It allows enterprises to focus more on business innovation and value creation.

The road to digital transformation is long and challenging.
But with persistence, progress will be made.

May more manufacturing enterprises leverage the power of open source to achieve a transformation from “manufacturing” to “intelligent manufacturing.”

Part 8 | Boundaries, Collaboration, and Best Practices Between Apache DolphinScheduler and Flink & Spark

Chen Debra — Fri, 17 Apr 2026 07:09:24 +0000

In the continuous evolution of data platforms, a very common yet subtle misconception is that teams unconsciously allow the scheduling system to take on more and more responsibilities that do not belong to it, such as writing complex business logic in the scheduling layer, controlling computation parameters, and even attempting to centrally manage execution details across different computing engines.

In the short term, this may seem to improve efficiency, but in the long run, such a design often makes the system highly coupled, difficult to maintain, and even causes it to lose stability as scale increases.

Therefore, before discussing specific practices, we must first clarify one thing: the boundary between the scheduling system and data engines.

Responsibilities and Boundaries Between the Scheduler and Data Engines

To understand how the entire system operates, it is helpful to remember a very core principle: the scheduling system is only responsible for “when to run” and “dependency relationships,” while “how to compute” must be left to execution engines such as Spark, Flink, or SeaTunnel.

In other words, DolphinScheduler is the orchestrator of workflows, not the executor of computation.

From an engineering perspective, this division of responsibilities can be clearly expressed in the following table:

Component	Core Responsibility
DolphinScheduler	DAG orchestration, task scheduling, dependency management, failure retry
Spark	Offline batch processing
Flink	Real-time stream processing
SeaTunnel	Data integration (batch / streaming / CDC)

In actual development, the place where this boundary is most easily broken is often the Shell task.

Many people are accustomed to writing complex branching logic in a single node, for example, deciding which Spark job to execute based on the date:

if [ "$day" == "2026-04-01" ]; then
  spark-submit job_a.py
else
  spark-submit job_b.py
fi

Although this approach “works,” it brings three problems: first, the logic is hidden inside the script and cannot be perceived by the DAG; second, dependency relationships are no longer clear, affecting the visualization capability of the scheduling system; third, the cost of maintenance and troubleshooting will increase significantly in the later stages.

A more reasonable approach is to explicitly model the branching logic in the workflow and control the execution path through conditional nodes, so that the entire process is visible and controllable in the UI.

Differences in Scheduling Between Batch, Streaming, and CDC

After the boundaries are clear, when we look at the scheduling methods of different types of tasks, we will find that they are essentially three completely different models, rather than simple variations of the same scheduling logic.

First is batch processing, which is the type of scenario that best fits the traditional scheduling model, such as T+1 tasks in a data warehouse or aggregation computations running hourly.

Such tasks have clear time windows and well-defined upstream and downstream dependencies, making them very suitable to be expressed through DAGs.

In practice, they are usually split into layers such as ODS, DWD, and DWS, with each layer corresponding to one or more independent tasks, and driven by parameters (such as ${biz_date}).

For example, a typical Spark submission method is as follows:

spark-submit \
  --class com.example.ETLJob \
  --master yarn \
  --deploy-mode cluster \
  etl-job.jar \
  --date ${biz_date}

In this process, the responsibility of the scheduling system is to connect task relationships, control execution order, and handle failure retries, rather than diving into the specific computation logic.

In contrast to batch processing, streaming tasks are fundamentally “continuously running,” rather than “periodically triggered.”

If a scheduling system is used to start a Flink job every few minutes, it is essentially solving the problem in the wrong way.

A well-designed streaming task should rely on Flink’s own state management and checkpoint mechanism to run continuously, while DolphinScheduler plays more of a “guardian” role, responsible for initial startup, status detection, and exception recovery, rather than frequent intervention.

Looking further at CDC scenarios, it is essentially also a type of streaming processing, but more oriented toward data integration, which is exactly a typical application scenario of SeaTunnel.

Through SeaTunnel, it is very convenient to implement real-time synchronization from databases to message queues, for example, from MySQL to Kafka:

env {
  execution.parallelism = 2
}

source {
  MySQL-CDC {
    hostname = "localhost"
    port = 3306
    username = "root"
    password = "123456"
    database-names = ["test_db"]
    table-names = ["test_db.user"]
  }
}

sink {
  Kafka {
    topic = "user_cdc"
    bootstrap.servers = "localhost:9092"
  }
}

The corresponding startup command is as follows:

./bin/seatunnel.sh \
  --config config/mysql_cdc.conf \
  -e local

At the scheduling level, the principle of CDC is consistent with streaming processing: start once, run continuously, and ensure stability through status detection mechanisms, rather than repeatedly triggering through periodic scheduling.

From this perspective, the core difference between batch processing, streaming processing, and CDC actually lies in whether it needs to be repeatedly scheduled.

Why the Scheduling System Should Not Intrude into the Execution Engine

As the system gradually scales, a deeper question will emerge: why do we repeatedly emphasize that the scheduling system should remain “restrained”?

The reason is that once the scheduling system begins to intrude into the responsibility scope of the execution engine, the controllability of the entire architecture will rapidly decline.

For example, directly writing Spark resource parameters in the scheduling script:

spark-submit \
  --executor-memory 8G \
  --conf spark.sql.shuffle.partitions=500 \
  job.sql

The problem with this approach is that it hardcodes execution-layer configurations into the scheduling layer, making parameter management scattered and difficult to unify.

Once resource configurations need to be adjusted, the scheduling task must be modified, or even the workflow must be redeployed.

A more reasonable approach is to place these parameters in the Spark configuration center or manage them within the job itself, allowing DolphinScheduler to only be responsible for triggering execution:

spark-submit job.sql

This decoupling approach can significantly improve system maintainability, allowing each layer to focus on its own responsibilities.

From an overall architectural perspective, a mature data platform can usually be abstracted into a three-layer structure: the top layer is the scheduling layer represented by DolphinScheduler, responsible for workflow orchestration; the middle layer is the execution layer represented by Spark, Flink, and SeaTunnel, responsible for specific computation and data processing; and the bottom layer is the resource layer such as YARN or Kubernetes, responsible for resource allocation and isolation.

Only when the boundaries of these three layers are clear can the system maintain stability as complexity increases.

A Practical Architecture Example Integrating SeaTunnel

In real production environments, this layered thinking is usually reflected in complete data pipelines.

For example, SeaTunnel can be used to implement CDC from MySQL to Kafka to synchronize real-time data; then Flink performs real-time computation to produce online metrics; at the same time, the data is landed into storage systems, and then Spark completes offline data warehouse processing.

In this process, DolphinScheduler is responsible for unified orchestration of these tasks, including starting CDC, monitoring streaming tasks, and scheduling offline computations.

From a process perspective, it can be abstracted into a clear data link: data enters from the source, goes through SeaTunnel into the real-time channel, is processed by Flink to serve online systems, is simultaneously written into storage, and then processed by Spark for layered transformation, while DolphinScheduler always acts as the “central hub,” coordinating execution order and dependency relationships across all stages.

Summary: Let the System Return to “Each Doing Its Own Job”

Returning to the original question, the design principle of the entire system can actually be summarized in one sentence: DolphinScheduler is the “brain,” while Spark, Flink, and SeaTunnel are the “muscles.”

The scheduling system is responsible for decision-making and orchestration, while the execution engines are responsible for specific computation and processing.

In practical implementation, it can be further summarized into three simple but very critical principles: first, all process logic must be reflected in the DAG, rather than hidden in scripts; second, all computation logic must be pushed down into the execution engines to avoid expansion of the scheduling layer; third, streaming processing and CDC tasks must be designed based on “long-running” operation, rather than being scheduled repeatedly in a batch-processing manner.

When these three points are strictly followed, the data platform can evolve from “just able to run” to “stable, scalable, and governable,” which is also a key step from engineering to systematic architecture.

Part 7 | Where Scheduling Systems Really Break and the Hidden Bottlenecks Beyond CPU and Scale

Chen Debra — Fri, 10 Apr 2026 09:58:17 +0000

In production environments, performance issues in a scheduling platform are never caused by a single bottleneck. Instead, they arise from the combined effects of scheduling decisions, task execution, metadata storage, and coordination mechanisms. Taking Apache DolphinScheduler as an example, focusing on just one component, such as the Master or Worker, often leads to misidentifying the root cause.

This article is based on real-world production experience. It systematically breaks down performance bottlenecks in a scheduling platform and provides practical, actionable optimization strategies.

1. From the overall architecture, where exactly are the bottlenecks?

The core workflow of DolphinScheduler can be abstracted as:

Scheduling → Execution → Storage → Coordination

Any layer can become a bottleneck, but the most common issues are concentrated in four areas:

Insufficient scheduling throughput on the Master
Mismatch between Worker execution capacity and workload
Excessive pressure on the database (MySQL/PostgreSQL)
Latency or instability in ZooKeeper (coordination layer)

2. The Master bottleneck is not CPU, but the “scheduling model”

Many assume the Master’s CPU is the issue. In practice, the real bottleneck is the combination of the scheduling model and database I/O.

2.1 Scheduling mechanism

The Master’s core loop looks like this:

// MasterSchedulerService.java
while (true) {
    List<ProcessInstance> instances = processService.findNeedScheduleProcessInstances();

    for (ProcessInstance instance : instances) {
        submitProcessInstance(instance);
    }
}

This is a polling + database-driven model. The key limitation is that scheduling capacity is directly tied to database throughput.

2.2 Typical symptoms

High scheduling latency:

Tasks are ready but delayed by tens of seconds before execution, while Master CPU usage remains low and database QPS is high.

Low throughput:

The system may only schedule a few hundred tasks per minute, and adding more Masters yields limited improvement.

2.3 Optimization strategies

Reduce database scanning pressure

Typical SQL:

SELECT * FROM t_ds_process_instance
WHERE state = 'READY'
LIMIT 100;

Optimization:

CREATE INDEX idx_state_priority_time 
ON t_ds_process_instance(state, priority, create_time);

Additional measures include limiting scan batch sizes and tuning scheduling intervals to avoid excessive polling.

Increase scheduling concurrency

Key configuration:

master:
  exec-threads: 100
  dispatch-task-number: 50

Practical guidelines:

exec-threads should be approximately 2 to 4 times the number of CPU cores.
dispatch-task-number should not be too large to avoid overwhelming Workers.

Scale out Masters

DolphinScheduler supports multiple Masters, but scaling is not linear due to shared database bottlenecks and ZooKeeper coordination overhead.

3. More Workers is not always better

Adding more Workers blindly can overload the database and worsen queuing.

3.1 Worker configuration

worker:
  exec-threads: 50

Workers act as both execution units and resource isolation boundaries.

3.2 Estimation formula

Worker count ≈ Total concurrent tasks / Per-Worker concurrency
Per-Worker concurrency ≈ CPU cores × (2 to 4)

3.3 Example

For 1,000 concurrent tasks and 16-core Workers:

Per Worker ≈ 32 to 64 concurrent tasks
Required Workers ≈ 1000 / 50 ≈ 20

3.4 Task type matters more

Short tasks (<5 seconds):

Scheduling overhead exceeds execution time, making the Master the bottleneck.

Long tasks (>10 minutes):

Workers become resource bottlenecks due to long occupation time.

4. Different strategies for short and long tasks

4.1 Short tasks optimization

Typical scenarios include SQL queries and API calls.

Batching example:

-- Before: multiple small queries
SELECT * FROM table WHERE id = 1;
SELECT * FROM table WHERE id = 2;

-- After: batch query
SELECT * FROM table WHERE id IN (1,2,3,...);

Other strategies include reducing DAG granularity and moving loops into scripts.

4.2 Long tasks optimization

Typical scenarios include Spark or Flink jobs.

The bottleneck lies in resource systems rather than the scheduler.

Strategies:

Bind workloads to YARN queues or Kubernetes namespaces and enforce concurrency limits.

5. The database bottleneck is the most underestimated

Around 80% of production performance issues ultimately relate to the database.

5.1 Common problems

Slow queries
Row-level lock contention
Connection pool exhaustion

5.2 Typical SQL

UPDATE t_ds_task_instance
SET state = 'RUNNING'
WHERE id = ?;

Frequent updates to the same rows lead to lock contention and reduced throughput.

5.3 Optimization strategies

Read-write separation

Masters handle writes, while APIs and queries use read replicas.

Reduce update frequency

Inefficient pattern:

RUNNING → RUNNING → RUNNING

Optimization:

Reduce heartbeat frequency.

Batch updates

// Batch update task states
updateBatch(taskInstances);

6. ZooKeeper as a hidden bottleneck

ZooKeeper is responsible for coordination, including Master election, Worker registration, and heartbeat management.

6.1 Common symptoms

Scheduling jitter under high load
Workers falsely marked as dead
Frequent Master failovers

6.2 Root causes

Improper session timeout settings
Too many nodes and connections
Network instability

6.3 Optimization

Example configuration:

tickTime=2000
initLimit=10
syncLimit=5

Recommendations:

Increase session timeout to at least 20 seconds to tolerate transient failures.
Deploy ZooKeeper independently to avoid resource contention.

7. A real-world optimization case

Background

Daily tasks: 200,000
DAGs: 30,000
Masters: 2
Workers: 30

Issues

Scheduling latency exceeded 1 minute during peak hours
Database CPU usage reached 90 percent

Optimization process

Step 1: Database indexing
Result: latency reduced by 40 percent

Step 2: Reduce short tasks
Result: DAG count reduced by 30 percent

Step 3: Adjust Master parameters

exec-threads: 50 → 120

Result: throughput doubled

Final results

Scheduling latency reduced from 60 seconds to 8 seconds
Database CPU usage reduced from 90 percent to 50 percent
Overall throughput improved by 2 to 3 times

8. Summary: the essence of scheduling performance optimization

The core insight is that performance is a balance of:

Scheduling capacity × Execution capacity × Storage capacity × Coordination capability

Optimization must be holistic:

The Master controls the scheduling rhythm
Workers provide execution capacity
The database defines system limits
ZooKeeper ensures coordination stability

Ultimately:

The limit of a scheduling system is not how many tasks it can dispatch, but how long the database can sustain the load.

Part 1 | Scheduling Systems Are More Than Just “Timers”
Part 2 | The Core Abstraction Model of Apache DolphinScheduler
Part 3 | How Scheduling Actually Runs
Part 4 | The State Machine: The Real Soul of Scheduling Systems
Part 5 | What Happens When Tasks Fail? A Complete Guide to Retry and Backfill in Apache DolphinScheduler
Part 6 | Enterprise Multi-Tenancy and Resource Isolation Techniques in DolphinScheduler You Might Not Know
Next: The boundaries between DolphinScheduler and Flink, Spark, and SeaTunnel

Can Your Scheduler Fix Itself at 2 AM? Inside the DolphinScheduler Agent Meetup

Chen Debra — Thu, 02 Apr 2026 10:18:14 +0000

If you’ve ever worked with scheduling systems, you’ve probably had moments like this:

At 2 AM, your phone suddenly lights up.
Not a message—an alert. A job has failed.

You stare at the screen, with only one thought in your head:
“Can it just fix itself?”

It sounds a bit idealistic.
But this time, we actually want to take it seriously.

Soon, the Apache DolphinScheduler community will host a new online Meetup.

This time, we won’t dive into grand architectures or complex theories.
Instead, we start with a very “engineer-like” question:

👉 Can a scheduling system require less human effort?

📅 Event Info

Time: April 21, 2026, 14:00–15:00
Format: Online livestream
Register your seat now: https://meeting.tencent.com/dm/sdXKjKfLewVe

🎤 Who’s Speaking?

This session features Liu Xiaodong,
an algorithm engineer from Shanghai FamilyMart Co., Ltd.

His self-introduction is quite fun:

Not limited to one direction—he tinkers with everything.
Writes code, builds systems, explores new ideas.
And occasionally “wanders around Hyrule to discover new landscapes.”

Sounds like this won’t be a conventional talk.

💡 What’s the Topic?

The topic is simple yet vivid:
“DolphinScheduler Agent: I Just Want to Lie Down and Still Get Work Done”

It starts from a very real idea:

The dream state of a “lazy engineer” is:
When something breaks, the system detects and fixes it automatically.
Humans just take a glance and say a word—everything else is handled.

Sounds exaggerated?

This talk will explore:
👉 How far can we actually go in this direction

🧠 What Will You Learn?

This is not a purely conceptual talk, but an ongoing exploration:

The design of DolphinScheduler Agent
How to make scheduling systems more “self-healing”
Real-world attempts and lessons learned
A working demo

Rather than giving standard answers, it’s more like:
a journey recap + new ways of thinking

🎁 Bonus

There will also be a lucky draw during the livestream 🎉

You might even win a custom Apache DolphinScheduler keychain—
a must-have for community members!

👀 Who Should Join?

This Meetup is for you if:

You’re using or exploring DolphinScheduler
You’re interested in automation, agents, or intelligent operations
You want to see real demos, not just slides
Or you simply want to “work less” in a smarter way

📢 Final Thought

We’re used to fixing problems when they occur.
But rarely do we ask:
Can systems prevent problems—or even solve them on their own?

Maybe that’s the next step for scheduling systems.

📅 April 21
Let’s talk about building systems that are a little less exhausting.

Apache DolphinScheduler Local Setup Made Simple: A Beginner-Friendly Guide

Chen Debra — Thu, 02 Apr 2026 10:08:09 +0000

This article is intended for developers who want to read and debug the core source code of Apache DolphinScheduler locally. The example environment is based on Windows + IntelliJ IDEA + Docker Desktop + PostgreSQL + ZooKeeper.

If you only want to quickly تجربه features rather than debug the full chain of master / worker / api, it is recommended to use StandaloneServer first. If you want to debug the distributed scheduling workflow, follow this guide to start services separately.

Use Cases

Start MasterServer, WorkerServer, and ApiApplicationServer individually in IntelliJ IDEA
Use Docker Desktop to host PostgreSQL and ZooKeeper
Debug Java services locally on the host machine
Run the frontend locally and connect it to backend APIs

Environment Requirements

Docker Desktop
JDK 8 or 11
Maven 3.8+ (or use the built-in mvnw.cmd)
Node.js 16+
pnpm 8+
IntelliJ IDEA

The java.version in the root pom.xml is 1.8. It is recommended to use JDK 8 or 11 for local debugging.

1. Start PostgreSQL and ZooKeeper

First, navigate to the deploy/docker directory:

cd <your-path>\dolphinscheduler\deploy\docker

If you are using the docker-compose-windows.yml provided in the appendix, ensure that dolphinscheduler-zookeeper exposes port 2181.

master, worker, and api all connect to localhost:2181 by default. If ZooKeeper runs only inside the container without port mapping, Java processes started in IDEA will fail to connect.

Ensure the following configuration exists:

dolphinscheduler-zookeeper:
  image: zookeeper:3.8
  ports:
    - "2181:2181"

Start services:

docker-compose -f docker-compose-windows.yml up -d dolphinscheduler-postgresql dolphinscheduler-zookeeper

Optional verification:

docker ps
Test-NetConnection 127.0.0.1 -Port 5432
Test-NetConnection localhost -Port 2181

Expected results:

Port 5432 is reachable
Port 2181 is reachable

If you are using local or remote installations instead of Docker, skip this step but ensure configurations match your environment.

2. Build the Project

cd <your-path>\dolphinscheduler
.\mvnw.cmd spotless:apply
.\mvnw.cmd clean install -DskipTests

Notes:

spotless:apply formats code to avoid check failures
The first build may take a while

3. Initialize PostgreSQL Metadata Database

Before starting master and api, initialize metadata tables.

SQL script location:

dolphinscheduler-dao/src/main/resources/sql/dolphinscheduler_postgresql.sql

Using Docker PostgreSQL:

Get-Content -Path .\dolphinscheduler-dao\src\main\resources\sql\dolphinscheduler_postgresql.sql -Raw |
docker exec -i -e PGPASSWORD=root docker-dolphinscheduler-postgresql-1 psql -U root -d dolphinscheduler

Alternatively, use DataGrip, DBeaver, or psql.

Note: This script contains DROP TABLE IF EXISTS. Do NOT run it on production databases.

Verification:

select version from t_ds_version;

Expected: one record returned (e.g., 3.4.0)

4. Verify Local Configuration

Default configs (usually no changes needed):

PostgreSQL: 127.0.0.1:5432
DB: dolphinscheduler
Username: root
Password: root
ZooKeeper: localhost:2181

Config files:

dolphinscheduler-master/.../application.yaml
dolphinscheduler-api/.../application.yaml
dolphinscheduler-worker/.../application.yaml

If needed, modify:

spring.datasource.url
spring.datasource.username
spring.datasource.password
registry.zookeeper.connect-string

Do NOT use:

-Dspring.profiles.active=mysql

Use instead:

-Dspring.profiles.active=postgresql

5. Configure IntelliJ IDEA Run Configurations

Common settings:

JDK: 8 or 11
Use the classpath of the module
Enable: Add dependencies with "provided" scope to classpath
Working directory: project root

This option is critical to avoid missing dependency issues.

Create these configurations:

MasterServer

Main class: org.apache.dolphinscheduler.server.master.MasterServer

Ports:

RPC: 5678
Spring Boot: 5679

WorkerServer

Main class: org.apache.dolphinscheduler.server.worker.WorkerServer

Ports:

RPC: 1234
Spring Boot: 1235

ApiApplicationServer

Main class: org.apache.dolphinscheduler.api.ApiApplicationServer

Ports:

HTTP: 12345
Gateway: 25333

Startup order:

MasterServer
WorkerServer
ApiApplicationServer

6. Start Frontend

cd <your-path>\dolphinscheduler\dolphinscheduler-ui
pnpm install
pnpm run dev

Access:

http://localhost:5173

Default credentials:

Username: admin
Password: dolphinscheduler123

7. Verification

API

/actuator/health → should return UP
/swagger-ui → should load successfully

Frontend

Access UI and log in successfully

Logs

Check for fatal errors in the IDEA console

8. Common Issues

ZooKeeper connection failed

ZooKeeper is not running
Port 2181 not exposed

Missing `t_ds_version` table

DB not initialized
Wrong database

Missing dependencies in IDEA

Check the “provided scope” option

Port 12345 occupied

Stop conflicting processes

Built by the Community: Apache DolphinScheduler March 2026 Highlights

Chen Debra — Thu, 02 Apr 2026 09:59:10 +0000

Hey there! The March 2026 monthly report is here! The Apache DolphinScheduler community has been on fire 🔥

A total of 13 contributors actively submitted code. Version 3.4.1 was released, bringing enhanced scheduling, upgraded task plugins, improved API & UI, and fixing 15+ bugs.

Meanwhile, infrastructure has also been upgraded. Both enterprise and individual users are encouraged to upgrade and explore the latest features. Let’s grow with the community 🚀

Reporting period: March 1, 2026 – March 30, 2026

1. Release

Version	Release Date	Notes
3.4.1	2026-03-01	Latest stable release

📎 Download: https://dolphinscheduler.apache.org/download

2. Key Feature Updates

2.1 Scheduling Enhancements

Feature	Description	PR
Configurable Max Runtime	Set maximum runtime limits for workflows/tasks	#17932
Worker Group Optimization	Allow creation of Worker Groups without Workers	#17927
Scheduling Timeout Detection	Handle cases with missing or unavailable Workers	#17796

2.2 Task Plugin Improvements

Task Type	Improvement	PR
Java Task	Support built-in & custom variables	#17860
Zeppelin Task	Support parameter parsing	#17862
Procedure Task	Support cancellation & output parameters	#17696, #17973
HTTP Task	Fix nested JSON sending issue	#17911

2.3 API & UI Improvements

Module	Improvement	PR
API	Remove import/export (DSIP-104)	#17941
UI	Improve Spark parameter validation	#17958
UI	Fix Keycloak icon 404 issue	#18007
UI	Fix lock not released on request failure	#17989

3. Bug Fixes

Module	Issue	PR
Master	Fix timeout alert failure	#17818
Master	Fix workflow failure strategy issue	#17851
Master	Fix task not marked failed on init error	#17821
Dependent	Fix PostgreSQL dependency SQL error	#17837
API	Fix token deletion issue for non-admin users	#17997
API	Add tenant validation	#17970
DAO	Fix type mismatch in workflow_definition_code	#17988
Alert	Fix timeout unit inconsistency	#17920
SeaTunnel	Fix broken documentation link	#17905
Params	Fix Procedure Task param passing issue	#17968

4. Community Updates

Top Contributors

In March, 31 PRs were merged. Thanks to all 9 contributors 🙌

Full list: https://github.com/apache/dolphinscheduler/graphs/contributors

Infrastructure Updates

Upgrade ZooKeeper to 3.8.3
Upgrade Testcontainers to 1.21.4
Update license year
Add AI usage confirmation to PR template

5. Enterprise Recommendations

🔧 Upgrade Advice

Production environments are recommended to upgrade to 3.4.1
Includes multiple bug fixes and stability improvements

📋 Key Features to Watch

Runtime limits for workflows/tasks
Flexible Worker Group management
Enhanced Procedure Task capabilities

⚠️ Notes

No major API changes this month
Follow official docs for latest configurations

6. Statistics

Metric	March Data
Releases	1 (3.4.1)
Improvements	10+
Bug Fixes	15+
Contributors	13+

Meet ASF’s New Member Xiang Zihao: How He Impacts the Community with Code and the Apache Way

Chen Debra — Fri, 27 Mar 2026 03:24:08 +0000

Congratulations to @xiang Zihao on being recently invited to become an ASF Member! As a PMC Member of Apache DolphinScheduler, the community is truly delighted by this well-deserved recognition.

Over the years, his continuous contributions to the community have been evident to all—from documentation improvements to code enhancements, from active discussions to helping newcomers. His presence can be seen everywhere. Beyond Apache DolphinScheduler, he is also deeply involved in multiple ASF open source projects, consistently practicing the Apache Way year after year. All his persistent efforts have finally led him to this milestone.

On this occasion, the community conducted another in-depth interview with him. This time, through five chapters—Personal Background, Open Source Contributions & Growth, Becoming an ASF Member, DolphinScheduler Community Development, and Open Source Culture—we take a closer look at his journey, his growth story in open source, and the passion and persistence he has accumulated within the community.

Part 1: Personal Background

Q1: Could you briefly introduce yourself, including how you entered the big data and open source fields?

A: I’m Xiang Zihao / SbloodyS 👋
My hobbies include coding during the day, gaming at night, taking my kid out on weekends, backpacking during holidays, and enjoying tea chats when I need a break.
My life philosophy is: explore the world through code, and heal through life.

Q2: When did you start contributing to Apache DolphinScheduler? What was the trigger?

A: I first encountered Apache DolphinScheduler in 2021. It was actually quite accidental—an opportunity at work introduced me to this scheduling system. Unexpectedly, this “chance encounter” gradually drew me in, and I began contributing to the community.

Q3: What key work or features have you contributed to DolphinScheduler?

A: I have mainly worked on documentation optimization, performance improvements, bug fixes, code reviews, and CI/CD optimization.

Part 2: Open Source Contributions & Growth

Q4: In open source collaboration, what do you think is the most important ability? Technical skills, communication, or something else?

A: I believe the most important ability in open source collaboration is not a single dimension, but a combination of technical skills, communication ability, and an open mindset.
Technical skills are the foundation, communication determines efficiency and quality, and an open mindset is the key to long-term growth.
If I had to prioritize, I’d say openness is the most fundamental—it determines whether you are willing to learn, ask, and evolve.

Q5: What advice would you give to newcomers in open source?

A: Start by “using” rather than “building.”
Become a real user first, identify problems during usage, submit issues, then gradually move to documentation fixes, bug fixes, and eventually core feature development.
Don’t aim to contribute “big features” right away—every small PR is the beginning of building trust with the community.

Part 3: Becoming an ASF Member

Q6: Congratulations on becoming an ASF Member! What was your first reaction?

A: Thank you! Honestly, my first reaction was a mix of surprise and gratitude.

Surprise—because becoming an ASF Member was never my initial goal. In 2021, I simply started contributing to solve problems and give back to the community, and I never imagined this journey would lead here.

Gratitude—because this honor represents the trust and support of the entire community. Without patient reviewers and fellow contributors, I wouldn’t be here today.

For me, becoming an ASF Member is not an endpoint, but a new beginning. It means greater responsibility and a commitment to give back even more.

Q7: How closely related is this achievement to DolphinScheduler? What other factors contributed?

A: DolphinScheduler was an important foundation, but not the only reason.

On one hand, it’s the first Apache project I deeply engaged in, where I built experience and credibility through contributions.

On the other hand, ASF evaluates broader impact:

Cross-project contributions
Community-building efforts
Practicing the Apache Way

In short, DolphinScheduler was my starting point, but sustained and sincere contributions to the broader Apache ecosystem made this possible.

Q8: What does becoming an ASF Member mean to you and the community?

A: For me, it’s recognition from the global open source community—not for one achievement, but for long-term commitment. It’s also a responsibility to keep improving.

For the community, ASF Members are core contributors responsible for project incubation, governance, and cultural inheritance.

For China’s open source ecosystem, more ASF Members represent growing global recognition and diversity.

Q9: How important is the Apache Way to project success?

A: It can be summed up in one phrase: “Community Over Code.”
Code can be replaced, but a healthy, collaborative community cannot.
The Apache Way ensures openness, transparency, and consensus-driven development—proven principles behind many successful projects.

Part 4: DolphinScheduler Community Development

Q10: What are the key milestones in DolphinScheduler’s growth?

A: Three major turning points:

Donation to Apache
Graduation from incubation
Globalization

These milestones transformed it into a globally recognized project.

Q11: How do you see its positioning and future?

A: DolphinScheduler is evolving into a next-generation cloud-native workflow orchestration platform, connecting the full data lifecycle.
Its future lies in integrating with modern data stacks and becoming essential for data engineers worldwide.

Q12: What are your future plans as an ASF Member?

A: Three directions: Deepening, Expanding, and Passing On.

Deepening: continue contributing to core tech and governance
Expanding: engage in more Apache projects
Passing On: help more developers enter open source

Open source has given me a lot—I want to pass it forward.

Part 5: Open Source Culture & Personal Growth

Q13: How has open source changed you?

A: It reshaped my definition of growth.
Before, growth meant improving skills. Now, it means expanding impact—helping others grow.
I’ve transformed from a solo problem-solver into a global collaborator.

Q14: How would you summarize the spirit of open source in one sentence?

A: Open source is a belief that sharing is more powerful than owning.

That concludes our interview! If you found this inspiring, feel free to like, share, and spread the word so more people can discover valuable insights from the open source world 🏅

Part 6 | Enterprise Multi-Tenancy and Resource Isolation Techniques in DolphinScheduler You Might Not Know

Chen Debra — Fri, 27 Mar 2026 03:22:57 +0000

In Apache DolphinScheduler, multi-tenancy is not just an “auxiliary permission feature,” but the core execution model of the scheduling system. What it truly solves is not “who can use the system,” but:

Under what identity tasks run, what resources they consume, and how to prevent interference between them

Only by understanding this can we grasp the essence of DolphinScheduler’s multi-tenant design.

What Are Single-Tenant and Multi-Tenant?

First, let’s clarify what single-tenant and multi-tenant mean.

In enterprise scheduling platforms, how different teams or business units share platform resources is a fundamental design concern. Single-tenancy and multi-tenancy are two common models, with clear differences in resource isolation, stability, and scalability. Understanding these differences helps organizations choose the right architecture for efficient and controllable scheduling.

A single-tenant system serves only one team or business unit. All tasks share the same execution environment, resource pool, and permission system.

A multi-tenant system, on the other hand, allows multiple teams to share one platform. Each team is logically isolated as an independent Tenant and mapped to underlying execution identities (Linux users), resource queues (YARN queues), or cloud-native namespaces (Kubernetes namespaces), enabling independent management of tasks and resources.

Compared with single-tenancy, multi-tenancy provides significant advantages in resource isolation, stability, and scalability. While single-tenancy is simple to deploy and manage, resource contention and task interference become inevitable as the number of users grows. Multi-tenancy avoids this by clearly isolating Tenants and assigning dedicated resource pools per team or environment.

Core Mechanism: Tenant-Centric Execution Model

To overcome the limitations of single-tenancy, Apache DolphinScheduler adopts a multi-tenant design.

At the heart of this design is a single concept: Tenant.

However, a Tenant is not just a logical label—it is an execution context container. When a task is scheduled, the system determines three key aspects based on the Tenant:

1. Execution Identity

Tasks do not run abstractly on Worker nodes; they must run as a specific OS user. A Tenant is bound to a Linux user, and tasks execute under that identity, inheriting file permissions and system-level isolation.

Example: Executing tasks as a Linux user

# Switch to the Linux user corresponding to the Tenant
sudo su - team_alpha_user

# Execute workflow task
spark-submit --class com.example.Job /opt/jobs/job.jar

Description: Tenant is bound to an OS user, and tasks run under this identity on Worker nodes, achieving file permission and environment isolation.
Tip: Ensure each Tenant has an independent home directory to avoid unauthorized access. ### 2. Resource Ownership

When tasks are submitted to engines like Spark or Flink, they must enter a resource pool. The Tenant determines the target resource queue or namespace, ensuring controlled resource usage.

Example: Create a Tenant and bind a YARN Queue

curl -X POST http://dolphinscheduler-api:12345/tenants \
  -H "Content-Type: application/json" \
  -d '{
        "name": "team_alpha",
        "queue": "team_alpha_queue",
        "description": "Team Alpha Tenant"
      }'

Description: Each Tenant corresponds to a YARN Queue or K8s Namespace, ensuring exclusive resource usage.
Tip: After creating a Tenant, remember to configure the queue or namespace in the resource scheduling system. ### 3. Isolation Boundary

Tenant defines a clear boundary for data access, task execution, and resource usage, forming logical isolation between teams.

Together, these three aspects form the foundation of DolphinScheduler’s multi-tenant mechanism.

How Resource Isolation Is Achieved

Multi-tenancy alone at the scheduling layer is not enough. The key design of DolphinScheduler is mapping Tenants to real underlying resource systems.

YARN-Based Isolation

In traditional big data architectures, Tenants are mapped to YARN queues. Each Tenant corresponds to a queue with defined capacity and limits. Tasks are submitted with queue information and scheduled accordingly, preventing resource contention.

YARN Mapping Example:

Queue configuration

<queue name="team_alpha_queue">
  <capacity>30</capacity>
  <maximum-capacity>50</maximum-capacity>
  <user-limit-factor>1.0</user-limit-factor>
</queue>

Description: Tasks automatically enter the queue when submitted, avoiding resource conflicts between Tenants.
Tip: Capacity and maximum capacity can be dynamically adjusted based on team workload.

Even if one team submits a large number of tasks, it only consumes resources within its own queue.

Kubernetes-Based Isolation

In cloud-native environments, Tenants are mapped to Kubernetes namespaces. Tasks run as Pods, and:

ResourceQuota limits total resource usage
LimitRange restricts per-task resource consumption

apiVersion: v1
kind: Namespace
metadata:
  name: team-alpha
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-quota
  namespace: team-alpha
spec:
  hard:
    cpu: "20"
    memory: "64Gi"
    pods: "50"

Description: Limits total resources and number of Pods to achieve cloud-native isolation.
Tip: Combine with LimitRange to control per-task resource limits and prevent a single task from monopolizing resources.

This approach isolates not only resources but also runtime environments and networking.

OS-Level Isolation

At the execution layer, Linux users provide the final isolation boundary. Even on the same machine, tasks from different Tenants cannot access each other’s files or scripts.

End-to-End Execution Flow

Putting everything together, the execution flow looks like this:

A workflow is triggered in DolphinScheduler
The system determines the Tenant
The Master assigns tasks to Workers
Workers switch to the corresponding Linux user
Tasks are submitted with resource metadata (YARN queue / K8s namespace)
Tasks run within the assigned resource pool under defined limits

This creates full isolation from scheduling logic to resource execution.

Technical Architecture

The architecture can be understood in three layers:

Top Layer: DolphinScheduler (Tenant / Workflow)
Middle Layer: Mapping (Linux User / YARN Queue / K8s Namespace)
Bottom Layer: Resource systems (Compute nodes / Big data clusters / Kubernetes clusters)

The key idea is:

The scheduling layer does not directly manage resources—it controls them through Tenant mapping

Why This Design Works in Enterprises

This design becomes especially powerful in enterprise environments.

When multiple teams share a platform, resource contention is inevitable. Without Tenant-to-resource mapping, a high-load workload could impact the entire system. With proper isolation, each team operates within its own boundaries.

It also simplifies troubleshooting. Issues can be traced to a specific Tenant and then to its corresponding resource pool, without affecting the entire system.

Most importantly, the design is highly scalable. Adding new teams or integrating new compute engines only requires extending Tenant mappings, without redesigning the scheduling system.

Summary

DolphinScheduler’s multi-tenant design is essentially a way to embed the scheduling system into the resource ecosystem. Instead of relying on complex logic, it leverages operating systems, resource schedulers, and container platforms to build a stable, clear, and controllable execution model.

For engineers, the real focus is not:

“How to create a Tenant”

but rather:

“How to map Tenants to resources effectively to achieve true isolation and stability”

That is the core value of multi-tenant design.

Previous articles: Part 1 | Scheduling Systems Are More Than Just “Timers”

Part 2 | The Core Abstraction Model of Apache DolphinScheduler

Part 3 | How Scheduling Actually Runs

Part 4 | The State Machine: The Real Soul of Scheduling Systems

Part 5 | What Happens When Tasks Fail? A Complete Guide to Retry and Backfill in Apache DolphinScheduler

Next article preview: Part 7 | Where Are the Performance Bottlenecks in Scheduling Platforms?

Apache SeaTunnel 2.3.13 Major Release! Top 10 Features You Should Know

Chen Debra — Fri, 20 Mar 2026 09:35:27 +0000

Apache SeaTunnel community officially released version 2.3.13! This release is a milestone for Apache SeaTunnel, bringing important features such as Checkpoint API, Flink engine upgrade, large file parallel processing, multi-table sync, AI Embedding Transform, and richer connector extensions. Whether for batch processing or real-time CDC syncing to Lakehouse, SeaTunnel can now support your data integration tasks more efficiently, stably, and intelligently.

Thanks to 50+ community contributors, this release includes 100+ PRs of new features, optimizations, and bug fixes. If you are building data warehouses, real-time sync platforms, or AI data pipelines, this release is worth your attention.

No time to read the full Release Notes? No worries, here are the Top 10 features of this release with PR references for your reference.

Full Release Note: https://github.com/apache/seatunnel/releases/tag/2.3.13

01 New Checkpoint API Enhances Task Fault Tolerance

In data sync tasks, checkpoints are one of the core mechanisms to ensure task reliability. SeaTunnel 2.3.13 introduces Checkpoint API (#10065), making task state management more flexible and providing a solid foundation for future scheduling and operation capabilities. The Zeta engine supports min-pause configuration (#9804) to avoid system pressure caused by frequent checkpoints.

Monitoring has also been enhanced, such as adding Sink commit metrics and calculating commit rate (#10233), returning PendingJobs information in the task overview interface (#9902), and providing REST API to view the Pending queue (#10078).

These capabilities help users better understand task execution status and optimize checkpoint strategies.

02 Flink 1.20.1 Support and Enhanced CDC

On the engine side, this version improves Apache Flink support. SeaTunnel now supports Flink 1.20.1 (#9576), and CDC sync capabilities have been enhanced. CDC Source now supports Schema Evolution (#9867), automatically adapting sync tasks to source table structure changes.

Additionally, NO_CDC Source also supports checkpoints (#10094), improving task recovery. These changes make SeaTunnel more stable in scenarios with frequent database schema changes.

03 Large File Parallel Reading Significantly Improved

In real data platforms, large amounts of data often exist as files, such as HDFS, object storage, or local file systems.

This release significantly optimizes file processing performance. HDFS File Connector supports true large file parallel splitting (#10332), LocalFile Connector supports CSV, Text, JSON large file parallel reading (#10142), and Parquet files now support Logical Split (#10239).

HDFS File also supports multi-table reading (#9816). These improvements significantly increase throughput for TB-scale file processing.

04 File Connector Adds Update Sync Mode

Previously, file sync tasks only supported append or overwrite. In this version, multiple file connectors add sync_mode=update, including FTP, SFTP, and LocalFile Source (#10437), and HdfsFile Source (#10268). This allows file sync tasks to support update semantics, better fitting incremental data processing scenarios.

05 Connector Ecosystem Expansion

SeaTunnel 2.3.13 continues to expand and enhance the connector ecosystem. For analytical databases, it adds DuckDB Source and Sink support (#10285), suitable for local analysis and data exploration.

New or enhanced connectors include Apache HugeGraph Sink (#10002), AWS DSQL Sink (#9739), Lance Dataset Sink (#9894), IoTDB 2.x Source and Sink (#9872).

Existing connectors have also been improved: PostgreSQL supports TIMESTAMP_TZ (#10048), Hive Sink supports SchemaSaveMode and DataSaveMode (#9743), MongoDB Sink supports multi-table writing and adds SaveMode (#9958 / #9883).

These updates significantly improve SeaTunnel’s adaptability in database and Lakehouse scenarios and the efficiency of building data pipelines.

Category	Connector	Type	Feature Highlights	PR
Analytical DB	DuckDB	Source/Sink	Read and write data from DuckDB, suitable for local analysis and exploration	#10285
Graph DB	Apache HugeGraph	Sink	Write data into HugeGraph	#10002
SQL Lakehouse	AWS DSQL	Sink	Write data into AWS DSQL	#9739
File/Dataset	Lance Dataset	Sink	Write data into Lance Dataset	#9894
Time Series DB	IoTDB 2.x	Source/Sink	Add IoTDB 2.x source and sink support	#9872
Relational DB	PostgreSQL	Source	Support TIMESTAMP_TZ type	#10048
Data Warehouse	Hive	Sink	Support SchemaSaveMode and DataSaveMode	#9743
Document DB	MongoDB	Sink	Support multi-table write and new SaveMode	#9958 / #9883

06 Kafka Supports Protobuf Schema Registry

In real-time scenarios, Kafka often uses Schema Registry. This release adds Protobuf Schema Registry Wire Format support (#10183) to Kafka Connector, allowing SeaTunnel to directly parse Protobuf data managed via Schema Registry, making real-time pipeline construction easier.

07 New AI Embedding Transform

With AI and data engineering integration, more companies need vector data pipelines.

SeaTunnel adds Multimodal Embedding Transform (#9673) in the Transform component, generating vector data directly in pipelines for vector databases, RAG systems, and AI retrieval applications. RegexExtract Transform (#9829) further enhances data cleaning.

08 Markdown Parser Supports RAG Scenarios

Markdown documents are common in AI data preparation. This release adds Markdown Parser (#9760) and related documentation (#9834) for parsing and structuring Markdown, facilitating RAG pipeline construction.

09 Stability and Performance Improvements

This release includes numerous stability and performance optimizations, such as ClickHouse Connector parallel read strategy (#9801), MySQL Connector shard calculation (#9975), JSON parsing for nested structures (#10000), Zeta engine task metrics (#9833), and more.

It also fixes production issues like Zeta engine memory leak on task cancellation (#10315), ClickHouse ThreadLocal memory leak (#10264), MongoDB multi-task submit (#10116), HBase Source scan exception (#10287), Hive Sink init failure (#10331), etc.

10 Bug Fixes and Documentation Updates

Fixes include CDC Snapshot Split null pointer (#10404), ClickHouse memory leak (#10264), MongoDB multi-task submit (#10064, #10116), HBase scan exceptions (#10336, #10287), JDBC schema merge overflow (#10387, #9942, #10093), Hive Sink overwrite semantics (#10279, #9823, #9743), Elasticsearch Sink task exit issue (#10038), and other Connector, Transform, Engine, UI, CI fixes (#10422, #10013, etc.).

Documentation improvements include SeaTunnel MCP & x2SeaTunnel docs (#10108), connector config examples (#10283, #10250, #10241, #10202), multi-table sync examples (#10241), upgrade incompatibility notes (#10068), and doc structure optimizations (#10262, #10395, #10351, #10420, #10438, #10424, #10109, #10382, #10385), helping new users get started and developers better understand architecture and features.

Thanks to Contributors ❤️

Special thanks to release manager @xiaochen-zhou for strong support in planning and execution. Thanks to all volunteers; your efforts keep the SeaTunnel community growing!

Adam Wang, AzkabanWarden.Gf, Bo Schuster, cloud456, CloverDew, corgy-w, CosmosNi, Cyanty, David Zollo, dotfive-star, dy102, dyp12, Frui Guo, Jarvis, Jast, Jeremy, JeremyXin, Jia Fan, Joonseo Lee, krutoileshii, 老王, Leon Yoah, Li Dongxu, LiJie20190102, limin, LimJiaWenBrenda, liucongjy, loupipalien, mengxpgogogo-eng, misi, 巧克力黑, shfshihuafeng, silenceland, Sim Chou, Steven Zhao, wanmingshi, wtybxqm, yzeng1618, zhan7236, zhangdonghao, zhuxt2015, zy

Download & Try

Download: https://seatunnel.apache.org/download
Upgrade Guide: https://seatunnel.apache.org/docs/upgrade-guide

Upgrade Note: If you are on SeaTunnel 2.3.x, upgrading to 2.3.13 is generally safe as it focuses on feature enhancement and stability. Back up config files and test in staging. For tasks using checkpoints, stop tasks and confirm state consistency to avoid checkpoint conflicts. Check connector config changes (Hive, MongoDB, Kafka). If using Flink engine, consider upgrading to Flink 1.20.x for better compatibility and CDC support.

DEV Community: Chen Debra

Upgrading DolphinScheduler Across Major Versions: From 3.1.3 to 3.4.1 via API Automation

1. Background: Why Perform a Major Version Upgrade?

Existing Environment

Drivers Behind the Upgrade

Why We Did Not Use the Official Upgrade Path

2. Overall Migration Strategy: Bypassing the Official Upgrade Path with a “Rebuild + API” Approach

2.1 Core Idea

2.2 Comparison of Advantages and Risks

3. Detailed Implementation Steps

3.1 Environment Preparation

3.1.1 Deploying the New Environment

3.1.2 API Access Configuration

3.2 Metadata Database Initialization

3.3 Migration Script Development

3.3.1 Preliminary Preparation and Testing

3.3.2 Code Development — Reading Original Workflow Definitions

3.3.3 Creating Workflows via API

3.4 Migration Execution

3.4.1 Migration Procedure

3.4.2 Migration Execution Results

3.4.3 Runtime Status

4. Risk Control and Contingency Planning

4.1 Major Risks

4.2 Contingency Plans

4.2.1 Rollback Strategy

4.2.2 Data Backup

5. Conclusion

5.1 Project Outcomes

5.2 Lessons Learned

5.3 Future Plans

DolphinScheduler Agent Is Now Open-Source! Bringing Self-Healing Automation to DataOps

The Pain of Traditional Operations: Slow Recovery Isn’t About Commands — It’s About Fragmented Context

A Major Upgrade: From Fragmented Human Coordination to an Intelligent End-to-End Closed Loop

Five-Layer Core Architecture: Not Just Scripts, but a Safe and Controllable Intelligent Control Chain

1. L1 Event & Collaboration

2. L2 Session Integration

3. L3 Intelligent Orchestration

4. L4 Execution Control

5. L5 Governance & Reporting

Four Core Modules: Making Self-Healing Truly Production-Ready

📌 Channel: Native Feishu Entry Point for Unified Collaboration

📌 Runtime: Intelligent Orchestration Engine with Decoupled Rules and Execution

📌 Control Plane: dsctl as the Unified Execution Foundation

A Seven-Step Standard Closed Loop: Dual-Path Protection for Production Safety

📌 Safety: Four-Level Risk Governance — Safety Comes First

A Pragmatic Roadmap: Gradual Delegation Toward Autonomous Operations

Demo

🎉 Official Open Source Release: dsctl Is Now Available on GitHub

Final Thoughts

Part 9 | Beyond Scheduling: How Data Platforms Evolve into DataOps Systems

The Evolution of the Scheduler’s Role

Engineering Transformation Driven by Standards

How Scheduling Platforms Support Engineering Governance

DataOps Practices with Apache DolphinScheduler

The Evolution Path of Enterprise Data Platforms

A Governable Data Task in Practice

Conclusion

Previous articles:

How A Leading Manufacturing Enterprise in Shenzhen Deploys Apache DolphinScheduler Across Dozens of Factories Within One Day?

The Era of Intelligent Manufacturing

Challenges of Traditional Data Processing Approaches

The Apache DolphinScheduler Solution

Worker Node Grouping: A Solution for Complex Network Environments

Data Collection

Data Interaction

Template-Based Data Collection and Distribution Across Multiple Factories

A Qualitative Leap: From Manual Workshop to Industrial Pipeline

Results and Future Outlook

Conclusion

Part 8 | Boundaries, Collaboration, and Best Practices Between Apache DolphinScheduler and Flink & Spark

Responsibilities and Boundaries Between the Scheduler and Data Engines

Differences in Scheduling Between Batch, Streaming, and CDC

Why the Scheduling System Should Not Intrude into the Execution Engine

A Practical Architecture Example Integrating SeaTunnel

Summary: Let the System Return to “Each Doing Its Own Job”

Previous articles:

Part 7 | Where Scheduling Systems Really Break and the Hidden Bottlenecks Beyond CPU and Scale

1. From the overall architecture, where exactly are the bottlenecks?

2. The Master bottleneck is not CPU, but the “scheduling model”

Missing `t_ds_version` table