DEV Community

Chen Debra
Chen Debra

Posted on

Upgrading DolphinScheduler Across Major Versions: From 3.1.3 to 3.4.1 via API Automation

1. Background: Why Perform a Major Version Upgrade?

Existing Environment

Current DolphinScheduler Version: 3.1.3
Current SeaTunnel Version: 2.1.3

Deployment Scale:
1 Master + 2 Workers, with over 3,700 workflow definitions and more than 20,000 scheduled tasks executed daily.

Years in Production:
The system has been running stably for over three years.

Drivers Behind the Upgrade

  1. Functional Requirements:
    With growing business demands, the limitations of the current version became increasingly apparent, including architectural design constraints, metadata database processing bottlenecks, and insufficient server resources.

  2. Community Support:
    The official community recommends adopting the latest stable release to obtain better technical support.

  3. Performance Optimization:
    Version 3.4.1 delivers significant improvements in scheduling performance and system stability.

Why We Did Not Use the Official Upgrade Path

  1. Large Version Gap:
    The upgrade path would require multiple intermediate upgrades:
    3.1.3 → 3.2.0 → 3.3.0 → 3.4.1.

  2. Production Environment Constraints:
    Multiple maintenance windows were unacceptable due to strict business continuity requirements.

  3. Architectural Change Risks:
    Risks included resource center refactoring, metadata schema changes, and compatibility issues.

  4. Workload Considerations:
    With thousands of workflows, manually rebuilding tasks would require enormous effort, making automation essential.

2. Overall Migration Strategy: Bypassing the Official Upgrade Path with a “Rebuild + API” Approach

2.1 Core Idea

Instead of pursuing an “in-place incremental upgrade,” we adopted a “new environment + data migration” strategy:

  • The old 3.1.3 cluster continued running without impacting production workloads.
  • A brand-new 3.4.1 cluster was deployed to ensure a clean architecture.
  • Custom scripts were developed to retrieve workflow definitions and task configurations from the old-version APIs.
  • New-version APIs were used to batch-create workflows in the new cluster.
  • Business scheduling traffic was gradually switched to the new environment.

2.2 Comparison of Advantages and Risks

Dimension Official Upgrade Approach Rebuild + API Approach
Downtime Multiple upgrades with cumulative downtime potentially lasting hours or even days Nearly seamless cutover by stopping old schedules and enabling new ones
Rollback Difficulty Difficult, as the database schema has already changed Easy, since the old environment remains intact
Data Consistency Requires validation of all schema migrations Only core business data (workflow definitions) is migrated; historical execution records are excluded
Version Compatibility Must handle compatibility issues across all intermediate versions Directly adapts to 3.4.1 with only necessary parameter transformations
Workload Requires repeated validation cycles Effort mainly concentrated on script development
Suitable Scenarios Minor version upgrades Major version jumps and large-scale task migration

3. Detailed Implementation Steps

3.1 Environment Preparation

3.1.1 Deploying the New Environment

  • Deploy a brand-new DolphinScheduler 3.4.1 cluster.
  • Configure dependencies such as the database and Registry. ZooKeeper was deprecated and replaced with JDBC Registry.
  • Configure required components such as DataX and SeaTunnel.
  • Understand key changes in the new version, including:

    • SeaTunnel 2.1.3 integration startup mode changed from Spark engine execution to start-seatunnel-spark.sh;
    • Default configurations such as tenants, worker groups, and environments are now managed through project preferences;
    • Parameter passing behavior changed: downstream tasks must explicitly define IN parameters to receive upstream variable values.
  • Verify the basic functionality of the new environment and manually validate representative workflows.

3.1.2 API Access Configuration

  • Configure API access permissions in the new environment by creating new tokens in token management.
  • Obtain administrator tokens for API calls.
  • Verify API connectivity.

3.2 Metadata Database Initialization

  • Replicating Base Tables: Rebuild foundational metadata tables in the new metadata database according to the old-version configuration, including preserving IDs wherever possible. This significantly reduced modification complexity during script-based workflow restoration.
Verification Item Table
Tenant Table t_ds_tenant
Project Table t_ds_project
User Table t_ds_user
Environment Tables t_ds_environment, t_ds_environment_worker_group_relation
Worker Group Table t_ds_worker_group
Data Source Table t_ds_datasource

3.3 Migration Script Development

3.3.1 Preliminary Preparation and Testing

  1. Categorize workflows into template-based and non-template-based tasks.
  2. Select representative workflows and execute them in the new environment to verify successful execution and data synchronization accuracy.

3.3.2 Code Development — Reading Original Workflow Definitions

...
 // Retrieve all workflows and process them iteratively
        String processDefinitionUrl = OLD_URL + "/dolphinscheduler/projects/" + oldProjectCode +
                "/process-definition/query-process-definition-list";
        Map<String, String> map = new HashMap<>();
        map.put("projectCode", oldProjectCode);
        String pdRes = httpClientUtilOld.doGetRequest(processDefinitionUrl, map);
        ArrayList<JSONObject> dataList = parseResDataToList(pdRes);
        for (JSONObject job : dataList) {
            String oldWFCode = job.get("code").toString();
            Map<String, String> mapPara = new HashMap<>();
            String oldurl = OLD_URL + "/dolphinscheduler/projects/" + oldProjectCode
                    + "/process-definition/" + oldWFCode;
            mapPara.put("code", oldWFCode);
            mapPara.put("projectCode", oldProjectCode);

            String res = httpClientUtilOld.doGetRequest(oldurl, mapPara);
            JSONObject jsonObject = JSON.parseObject(res);
            JSONObject data = (JSONObject) jsonObject.get("data");
            JSONObject processDefinition = data.getJSONObject("processDefinition");
            JSONArray processTaskRelationList = data.getJSONArray("processTaskRelationList");
            JSONArray taskDefinitionList = data.getJSONArray("taskDefinitionList");

            // TODO: generate new task codes and replace old ones
            // Populate workflow information and create workflows
            createWF(processDefinition, processTaskRelationList, taskDefinitionList, NEW_IP, newProjectCode);
Enter fullscreen mode Exit fullscreen mode

3.3.3 Creating Workflows via API

 // Generate task codes based on task count
        int taskCnt = taskDefinitionList.size();
        List<String> taskCodeList = taskDefinitionList.stream()
                .map(obj -> (JSONObject) obj)
                .map(obj -> obj.getString("code"))
                .collect(Collectors.toList());

        try {
// TODO: generate task codes
            String taskCodeUrl = NEW_URL + "/dolphinscheduler/projects/" + newProjectCode + "/task-definition/gen-task-codes";
            HashMap<String, String> taskCodeMap = new HashMap<>();
            // Generate n task codes
            taskCodeMap.put("genNum", String.valueOf(taskCnt));
            String codeData = httpClientUtilNew.doGetRequest(taskCodeUrl, taskCodeMap);
            Object codes = JSON.parseObject(codeData).get("data");
            JSONArray taskCodeArr = JSON.parseArray(codes.toString());

// Add downstream input parameters based on actual task requirements
            for (int i = 0; i < taskDefinitionList.size(); i++) {
                JSONObject logTask = (JSONObject) taskDefinitionList.get(i);
                if (Condition Logic)) {
                    JSONObject taskParams = logTask.getJSONObject("taskParams");
                    JSONArray localParams = taskParams.getJSONArray("localParams");

                    JSONObject hiveParam = new JSONObject();
                    hiveParam.put("prop", "hiveAmount");
                    hiveParam.put("direct", "IN");
                    hiveParam.put("type", "VARCHAR");
                    hiveParam.put("value", "");
                    localParams.add(hiveParam);

                    logTask.put("taskParamList", localParams);

                    JSONObject paramMap = new JSONObject();
                    for (Object obj : localParams) {
                        JSONObject param = (JSONObject) obj;
                        paramMap.put(param.getString("prop"), param.getString("value"));
                    }

                    logTask.put("taskParamMap", paramMap);
                    ....


// Replace required parameters such as task code and SeaTunnel execution engine
 for (int i = 0; i < taskCodeList.size(); i++) {
                String oldCode = taskCodeList.get(i);
                String newCode = taskCodeArr.getString(i);

                // Replace task code
                // Replace SeaTunnel engine: "SPARK" -> "start-seatunnel-spark.sh"
                taskDefinitionListJsonStr = taskDefinitionListJsonStr
                        .replace("\"code\":" + oldCode + ",", "\"code\":" + newCode + ",")
                        .replace("\"engine\":\"SPARK\",", "\"startupScript\":\"start-seatunnel-spark.sh\",");

                taskRelationListJsonStr = taskRelationListJsonStr
                        .replace("TaskCode\":" + oldCode + ",", "TaskCode\":" + newCode + ",");

                locationsJsonStr = locationsJsonStr
                        .replace(oldCode, newCode);

...
  }
            Map<String, String> map = new HashMap<>();
            map.put("taskDefinitionJson", taskDefinitionListJsonStr);
            map.put("taskRelationJson", taskRelationListJsonStr);
            map.put("locations", locationsJsonStr);

            map.put("name", processDefinition.getString("name"));
            map.put("tenantCode", "omm");
            map.put("executionType", processDefinition.getString("executionType"));
            map.put("description",
                    processDefinition.getString("description") == null ? "" : processDefinition.getString("description"));
            map.put("globalParams", processDefinition.getString("globalParams"));
            map.put("timeout", processDefinition.getString("timeout"));

            String processDefinitionUrl = NEW_URL + "/dolphinscheduler/projects/" + newProjectCode + "/workflow-definition";
            String processDefinitionRes = httpClientUtilNew.doPostRequest(processDefinitionUrl, map);
Enter fullscreen mode Exit fullscreen mode

3.4 Migration Execution

3.4.1 Migration Procedure

  1. Back Up Old Scheduling Tables:
    Example: t_ds_schedules_20260416_10

  2. Select a Pilot Project:
    Choose a project with moderate workload and limited business impact.

  3. Migrate Workflow Definitions:
    Migrate approximately 200 workflow definitions, including scheduling configurations.

  4. Deploy Workflows Without Enabling Schedules:
    Deploy workflows first without activating schedules.

  5. Manual Validation:
    Execute workflows manually in batches and verify conflicts with the original cluster. Since most workflows run daily, hourly, or every 15 minutes, conflicts were minimal.

  6. Investigate Failed Tasks:
    Analyze root causes, fix issues, and rerun failed workflows.

  7. Enable Scheduling Configurations:
    Enable schedules after all workflows pass validation.

  8. Disable Old Cluster Scheduling:
    After confirming stable operation in the new environment, disable corresponding schedules in the old cluster.

  9. Migrate Project by Project, Batch by Batch

3.4.2 Migration Execution Results

Following this process, we first selected a representative project containing 199 workflows. After migration, it was tested in production for one week without issues.

Subsequently, we completed migration for 50 projects, totaling approximately 3,700 workflows, within about 10 days.

  • Tracking Table
No. Project Name Project Code Workflow Count Progress Remarks
1 Project 1 13******* 199 Completed on 04/16 One-week stability test completed
2 Project 2 ... ... ... ...
... ... ... ... ... ...
50 ... ... ... ... ...

3.4.3 Runtime Status

The new cluster has now been running for nearly one month.

Previously, scheduling delays ranged from 10 seconds to over one minute in severe cases. After the upgrade, scheduling latency has been virtually eliminated.

Issues related to missing scheduling instances have also not reoccurred.

So far, the system has been running smoothly without any identified problems, and continuous monitoring remains in place.

4. Risk Control and Contingency Planning

4.1 Major Risks

Risk of Data Loss:
Some configurations could potentially be missed during migration.

Compatibility Issues:
Certain configurations supported in the old version may not be supported in the new version.

Business Interruption Risks:
Scheduling delays could occur during the switchover process.

4.2 Contingency Plans

4.2.1 Rollback Strategy

  • Immediately stop scheduling in the new environment.
  • Restore scheduling services in the old environment.
  • Analyze root causes and retry after issue resolution.

4.2.2 Data Backup

  • Perform a complete backup of the old environment database.
  • Back up the initial configuration of the new environment.

5. Conclusion

5.1 Project Outcomes

  • Successfully completed a cross-version upgrade with zero business interruption.
  • Automated migration scripts significantly improved efficiency and reduced manual errors.
  • The new version delivered major performance gains and substantial stability improvements.

5.2 Lessons Learned

  • Major version upgrades require comprehensive evaluation of architectural changes.
  • API-based migration is highly suitable for configuration migration, though parameter compatibility must be handled carefully.
  • Thorough testing and validation are critical to success.
  • A robust monitoring system is essential for operational stability.
  • Comprehensive documentation is invaluable for long-term maintenance.

5.3 Future Plans

  • The current upgrade primarily addressed DolphinScheduler scheduling bottlenecks. To align with upcoming Spark cluster upgrades, the next step will be upgrading SeaTunnel from version 2.1.3 to 2.3.12, most likely using the same migration methodology.
  • Explore automated testing solutions.
  • Share migration experience with other teams.

Top comments (0)