Chen Debra

Posted on May 21

Upgrading DolphinScheduler Across Major Versions: From 3.1.3 to 3.4.1 via API Automation

#apachedolphinscheduler #upgade #ai #programming

1. Background: Why Perform a Major Version Upgrade?

Existing Environment

Current DolphinScheduler Version: 3.1.3
Current SeaTunnel Version: 2.1.3

Deployment Scale:
1 Master + 2 Workers, with over 3,700 workflow definitions and more than 20,000 scheduled tasks executed daily.

Years in Production:
The system has been running stably for over three years.

Drivers Behind the Upgrade

Functional Requirements:
With growing business demands, the limitations of the current version became increasingly apparent, including architectural design constraints, metadata database processing bottlenecks, and insufficient server resources.
Community Support:
The official community recommends adopting the latest stable release to obtain better technical support.
Performance Optimization:
Version 3.4.1 delivers significant improvements in scheduling performance and system stability.

Why We Did Not Use the Official Upgrade Path

Large Version Gap:
The upgrade path would require multiple intermediate upgrades:
3.1.3 → 3.2.0 → 3.3.0 → 3.4.1.
Production Environment Constraints:
Multiple maintenance windows were unacceptable due to strict business continuity requirements.
Architectural Change Risks:
Risks included resource center refactoring, metadata schema changes, and compatibility issues.
Workload Considerations:
With thousands of workflows, manually rebuilding tasks would require enormous effort, making automation essential.

2. Overall Migration Strategy: Bypassing the Official Upgrade Path with a “Rebuild + API” Approach

2.1 Core Idea

Instead of pursuing an “in-place incremental upgrade,” we adopted a “new environment + data migration” strategy:

The old 3.1.3 cluster continued running without impacting production workloads.
A brand-new 3.4.1 cluster was deployed to ensure a clean architecture.
Custom scripts were developed to retrieve workflow definitions and task configurations from the old-version APIs.
New-version APIs were used to batch-create workflows in the new cluster.
Business scheduling traffic was gradually switched to the new environment.

2.2 Comparison of Advantages and Risks

Dimension	Official Upgrade Approach	Rebuild + API Approach
Downtime	Multiple upgrades with cumulative downtime potentially lasting hours or even days	Nearly seamless cutover by stopping old schedules and enabling new ones
Rollback Difficulty	Difficult, as the database schema has already changed	Easy, since the old environment remains intact
Data Consistency	Requires validation of all schema migrations	Only core business data (workflow definitions) is migrated; historical execution records are excluded
Version Compatibility	Must handle compatibility issues across all intermediate versions	Directly adapts to 3.4.1 with only necessary parameter transformations
Workload	Requires repeated validation cycles	Effort mainly concentrated on script development
Suitable Scenarios	Minor version upgrades	Major version jumps and large-scale task migration

3. Detailed Implementation Steps

3.1 Environment Preparation

3.1.1 Deploying the New Environment

Deploy a brand-new DolphinScheduler 3.4.1 cluster.
Configure dependencies such as the database and Registry. ZooKeeper was deprecated and replaced with JDBC Registry.
Configure required components such as DataX and SeaTunnel.
Understand key changes in the new version, including:
- SeaTunnel 2.1.3 integration startup mode changed from Spark engine execution to start-seatunnel-spark.sh;
- Default configurations such as tenants, worker groups, and environments are now managed through project preferences;
- Parameter passing behavior changed: downstream tasks must explicitly define IN parameters to receive upstream variable values.
Verify the basic functionality of the new environment and manually validate representative workflows.

3.1.2 API Access Configuration

Configure API access permissions in the new environment by creating new tokens in token management.
Obtain administrator tokens for API calls.
Verify API connectivity.

3.2 Metadata Database Initialization

Replicating Base Tables: Rebuild foundational metadata tables in the new metadata database according to the old-version configuration, including preserving IDs wherever possible. This significantly reduced modification complexity during script-based workflow restoration.

Verification Item	Table
Tenant Table	`t_ds_tenant`
Project Table	`t_ds_project`
User Table	`t_ds_user`
Environment Tables	`t_ds_environment`, `t_ds_environment_worker_group_relation`
Worker Group Table	`t_ds_worker_group`
Data Source Table	`t_ds_datasource`

3.3 Migration Script Development

3.3.1 Preliminary Preparation and Testing

Categorize workflows into template-based and non-template-based tasks.
Select representative workflows and execute them in the new environment to verify successful execution and data synchronization accuracy.

3.3.2 Code Development — Reading Original Workflow Definitions

...
 // Retrieve all workflows and process them iteratively
        String processDefinitionUrl = OLD_URL + "/dolphinscheduler/projects/" + oldProjectCode +
                "/process-definition/query-process-definition-list";
        Map<String, String> map = new HashMap<>();
        map.put("projectCode", oldProjectCode);
        String pdRes = httpClientUtilOld.doGetRequest(processDefinitionUrl, map);
        ArrayList<JSONObject> dataList = parseResDataToList(pdRes);
        for (JSONObject job : dataList) {
            String oldWFCode = job.get("code").toString();
            Map<String, String> mapPara = new HashMap<>();
            String oldurl = OLD_URL + "/dolphinscheduler/projects/" + oldProjectCode
                    + "/process-definition/" + oldWFCode;
            mapPara.put("code", oldWFCode);
            mapPara.put("projectCode", oldProjectCode);

            String res = httpClientUtilOld.doGetRequest(oldurl, mapPara);
            JSONObject jsonObject = JSON.parseObject(res);
            JSONObject data = (JSONObject) jsonObject.get("data");
            JSONObject processDefinition = data.getJSONObject("processDefinition");
            JSONArray processTaskRelationList = data.getJSONArray("processTaskRelationList");
            JSONArray taskDefinitionList = data.getJSONArray("taskDefinitionList");

            // TODO: generate new task codes and replace old ones
            // Populate workflow information and create workflows
            createWF(processDefinition, processTaskRelationList, taskDefinitionList, NEW_IP, newProjectCode);

3.3.3 Creating Workflows via API

 // Generate task codes based on task count
        int taskCnt = taskDefinitionList.size();
        List<String> taskCodeList = taskDefinitionList.stream()
                .map(obj -> (JSONObject) obj)
                .map(obj -> obj.getString("code"))
                .collect(Collectors.toList());

        try {
// TODO: generate task codes
            String taskCodeUrl = NEW_URL + "/dolphinscheduler/projects/" + newProjectCode + "/task-definition/gen-task-codes";
            HashMap<String, String> taskCodeMap = new HashMap<>();
            // Generate n task codes
            taskCodeMap.put("genNum", String.valueOf(taskCnt));
            String codeData = httpClientUtilNew.doGetRequest(taskCodeUrl, taskCodeMap);
            Object codes = JSON.parseObject(codeData).get("data");
            JSONArray taskCodeArr = JSON.parseArray(codes.toString());

// Add downstream input parameters based on actual task requirements
            for (int i = 0; i < taskDefinitionList.size(); i++) {
                JSONObject logTask = (JSONObject) taskDefinitionList.get(i);
                if (“Condition Logic”)) {
                    JSONObject taskParams = logTask.getJSONObject("taskParams");
                    JSONArray localParams = taskParams.getJSONArray("localParams");

                    JSONObject hiveParam = new JSONObject();
                    hiveParam.put("prop", "hiveAmount");
                    hiveParam.put("direct", "IN");
                    hiveParam.put("type", "VARCHAR");
                    hiveParam.put("value", "");
                    localParams.add(hiveParam);

                    logTask.put("taskParamList", localParams);

                    JSONObject paramMap = new JSONObject();
                    for (Object obj : localParams) {
                        JSONObject param = (JSONObject) obj;
                        paramMap.put(param.getString("prop"), param.getString("value"));
                    }

                    logTask.put("taskParamMap", paramMap);
                    ....


// Replace required parameters such as task code and SeaTunnel execution engine
 for (int i = 0; i < taskCodeList.size(); i++) {
                String oldCode = taskCodeList.get(i);
                String newCode = taskCodeArr.getString(i);

                // Replace task code
                // Replace SeaTunnel engine: "SPARK" -> "start-seatunnel-spark.sh"
                taskDefinitionListJsonStr = taskDefinitionListJsonStr
                        .replace("\"code\":" + oldCode + ",", "\"code\":" + newCode + ",")
                        .replace("\"engine\":\"SPARK\",", "\"startupScript\":\"start-seatunnel-spark.sh\",");

                taskRelationListJsonStr = taskRelationListJsonStr
                        .replace("TaskCode\":" + oldCode + ",", "TaskCode\":" + newCode + ",");

                locationsJsonStr = locationsJsonStr
                        .replace(oldCode, newCode);

...
  }
            Map<String, String> map = new HashMap<>();
            map.put("taskDefinitionJson", taskDefinitionListJsonStr);
            map.put("taskRelationJson", taskRelationListJsonStr);
            map.put("locations", locationsJsonStr);

            map.put("name", processDefinition.getString("name"));
            map.put("tenantCode", "omm");
            map.put("executionType", processDefinition.getString("executionType"));
            map.put("description",
                    processDefinition.getString("description") == null ? "" : processDefinition.getString("description"));
            map.put("globalParams", processDefinition.getString("globalParams"));
            map.put("timeout", processDefinition.getString("timeout"));

            String processDefinitionUrl = NEW_URL + "/dolphinscheduler/projects/" + newProjectCode + "/workflow-definition";
            String processDefinitionRes = httpClientUtilNew.doPostRequest(processDefinitionUrl, map);

3.4 Migration Execution

3.4.1 Migration Procedure

Back Up Old Scheduling Tables:
Example: t_ds_schedules_20260416_10
Select a Pilot Project:
Choose a project with moderate workload and limited business impact.
Migrate Workflow Definitions:
Migrate approximately 200 workflow definitions, including scheduling configurations.
Deploy Workflows Without Enabling Schedules:
Deploy workflows first without activating schedules.
Manual Validation:
Execute workflows manually in batches and verify conflicts with the original cluster. Since most workflows run daily, hourly, or every 15 minutes, conflicts were minimal.
Investigate Failed Tasks:
Analyze root causes, fix issues, and rerun failed workflows.
Enable Scheduling Configurations:
Enable schedules after all workflows pass validation.
Disable Old Cluster Scheduling:
After confirming stable operation in the new environment, disable corresponding schedules in the old cluster.
Migrate Project by Project, Batch by Batch

3.4.2 Migration Execution Results

Following this process, we first selected a representative project containing 199 workflows. After migration, it was tested in production for one week without issues.

Subsequently, we completed migration for 50 projects, totaling approximately 3,700 workflows, within about 10 days.

Tracking Table

No.	Project Name	Project Code	Workflow Count	Progress	Remarks
1	Project 1	13*******	199	Completed on 04/16	One-week stability test completed
2	Project 2	...	...	...	...
...	...	...	...	...	...
50	...	...	...	...	...

3.4.3 Runtime Status

The new cluster has now been running for nearly one month.

Previously, scheduling delays ranged from 10 seconds to over one minute in severe cases. After the upgrade, scheduling latency has been virtually eliminated.

Issues related to missing scheduling instances have also not reoccurred.

So far, the system has been running smoothly without any identified problems, and continuous monitoring remains in place.

4. Risk Control and Contingency Planning

4.1 Major Risks

Risk of Data Loss:
Some configurations could potentially be missed during migration.

Compatibility Issues:
Certain configurations supported in the old version may not be supported in the new version.

Business Interruption Risks:
Scheduling delays could occur during the switchover process.

4.2 Contingency Plans

4.2.1 Rollback Strategy

Immediately stop scheduling in the new environment.
Restore scheduling services in the old environment.
Analyze root causes and retry after issue resolution.

4.2.2 Data Backup

Perform a complete backup of the old environment database.
Back up the initial configuration of the new environment.

5. Conclusion

5.1 Project Outcomes

Successfully completed a cross-version upgrade with zero business interruption.
Automated migration scripts significantly improved efficiency and reduced manual errors.
The new version delivered major performance gains and substantial stability improvements.

5.2 Lessons Learned

Major version upgrades require comprehensive evaluation of architectural changes.
API-based migration is highly suitable for configuration migration, though parameter compatibility must be handled carefully.
Thorough testing and validation are critical to success.
A robust monitoring system is essential for operational stability.
Comprehensive documentation is invaluable for long-term maintenance.

5.3 Future Plans

The current upgrade primarily addressed DolphinScheduler scheduling bottlenecks. To align with upcoming Spark cluster upgrades, the next step will be upgrading SeaTunnel from version 2.1.3 to 2.3.12, most likely using the same migration methodology.
Explore automated testing solutions.
Share migration experience with other teams.

DEV Community