1. Background: Why Perform a Major Version Upgrade?
Existing Environment
Current DolphinScheduler Version: 3.1.3
Current SeaTunnel Version: 2.1.3
Deployment Scale:
1 Master + 2 Workers, with over 3,700 workflow definitions and more than 20,000 scheduled tasks executed daily.
Years in Production:
The system has been running stably for over three years.
Drivers Behind the Upgrade
Functional Requirements:
With growing business demands, the limitations of the current version became increasingly apparent, including architectural design constraints, metadata database processing bottlenecks, and insufficient server resources.Community Support:
The official community recommends adopting the latest stable release to obtain better technical support.Performance Optimization:
Version 3.4.1 delivers significant improvements in scheduling performance and system stability.
Why We Did Not Use the Official Upgrade Path
Large Version Gap:
The upgrade path would require multiple intermediate upgrades:
3.1.3 → 3.2.0 → 3.3.0 → 3.4.1.Production Environment Constraints:
Multiple maintenance windows were unacceptable due to strict business continuity requirements.Architectural Change Risks:
Risks included resource center refactoring, metadata schema changes, and compatibility issues.Workload Considerations:
With thousands of workflows, manually rebuilding tasks would require enormous effort, making automation essential.
2. Overall Migration Strategy: Bypassing the Official Upgrade Path with a “Rebuild + API” Approach
2.1 Core Idea
Instead of pursuing an “in-place incremental upgrade,” we adopted a “new environment + data migration” strategy:
- The old 3.1.3 cluster continued running without impacting production workloads.
- A brand-new 3.4.1 cluster was deployed to ensure a clean architecture.
- Custom scripts were developed to retrieve workflow definitions and task configurations from the old-version APIs.
- New-version APIs were used to batch-create workflows in the new cluster.
- Business scheduling traffic was gradually switched to the new environment.
2.2 Comparison of Advantages and Risks
| Dimension | Official Upgrade Approach | Rebuild + API Approach |
|---|---|---|
| Downtime | Multiple upgrades with cumulative downtime potentially lasting hours or even days | Nearly seamless cutover by stopping old schedules and enabling new ones |
| Rollback Difficulty | Difficult, as the database schema has already changed | Easy, since the old environment remains intact |
| Data Consistency | Requires validation of all schema migrations | Only core business data (workflow definitions) is migrated; historical execution records are excluded |
| Version Compatibility | Must handle compatibility issues across all intermediate versions | Directly adapts to 3.4.1 with only necessary parameter transformations |
| Workload | Requires repeated validation cycles | Effort mainly concentrated on script development |
| Suitable Scenarios | Minor version upgrades | Major version jumps and large-scale task migration |
3. Detailed Implementation Steps
3.1 Environment Preparation
3.1.1 Deploying the New Environment
- Deploy a brand-new DolphinScheduler 3.4.1 cluster.
- Configure dependencies such as the database and Registry. ZooKeeper was deprecated and replaced with JDBC Registry.
- Configure required components such as DataX and SeaTunnel.
-
Understand key changes in the new version, including:
- SeaTunnel 2.1.3 integration startup mode changed from Spark engine execution to
start-seatunnel-spark.sh; - Default configurations such as tenants, worker groups, and environments are now managed through project preferences;
- Parameter passing behavior changed: downstream tasks must explicitly define
INparameters to receive upstream variable values.
- SeaTunnel 2.1.3 integration startup mode changed from Spark engine execution to
Verify the basic functionality of the new environment and manually validate representative workflows.
3.1.2 API Access Configuration
- Configure API access permissions in the new environment by creating new tokens in token management.
- Obtain administrator tokens for API calls.
- Verify API connectivity.
3.2 Metadata Database Initialization
- Replicating Base Tables: Rebuild foundational metadata tables in the new metadata database according to the old-version configuration, including preserving IDs wherever possible. This significantly reduced modification complexity during script-based workflow restoration.
| Verification Item | Table |
|---|---|
| Tenant Table | t_ds_tenant |
| Project Table | t_ds_project |
| User Table | t_ds_user |
| Environment Tables |
t_ds_environment, t_ds_environment_worker_group_relation
|
| Worker Group Table | t_ds_worker_group |
| Data Source Table | t_ds_datasource |
3.3 Migration Script Development
3.3.1 Preliminary Preparation and Testing
- Categorize workflows into template-based and non-template-based tasks.
- Select representative workflows and execute them in the new environment to verify successful execution and data synchronization accuracy.
3.3.2 Code Development — Reading Original Workflow Definitions
...
// Retrieve all workflows and process them iteratively
String processDefinitionUrl = OLD_URL + "/dolphinscheduler/projects/" + oldProjectCode +
"/process-definition/query-process-definition-list";
Map<String, String> map = new HashMap<>();
map.put("projectCode", oldProjectCode);
String pdRes = httpClientUtilOld.doGetRequest(processDefinitionUrl, map);
ArrayList<JSONObject> dataList = parseResDataToList(pdRes);
for (JSONObject job : dataList) {
String oldWFCode = job.get("code").toString();
Map<String, String> mapPara = new HashMap<>();
String oldurl = OLD_URL + "/dolphinscheduler/projects/" + oldProjectCode
+ "/process-definition/" + oldWFCode;
mapPara.put("code", oldWFCode);
mapPara.put("projectCode", oldProjectCode);
String res = httpClientUtilOld.doGetRequest(oldurl, mapPara);
JSONObject jsonObject = JSON.parseObject(res);
JSONObject data = (JSONObject) jsonObject.get("data");
JSONObject processDefinition = data.getJSONObject("processDefinition");
JSONArray processTaskRelationList = data.getJSONArray("processTaskRelationList");
JSONArray taskDefinitionList = data.getJSONArray("taskDefinitionList");
// TODO: generate new task codes and replace old ones
// Populate workflow information and create workflows
createWF(processDefinition, processTaskRelationList, taskDefinitionList, NEW_IP, newProjectCode);
3.3.3 Creating Workflows via API
// Generate task codes based on task count
int taskCnt = taskDefinitionList.size();
List<String> taskCodeList = taskDefinitionList.stream()
.map(obj -> (JSONObject) obj)
.map(obj -> obj.getString("code"))
.collect(Collectors.toList());
try {
// TODO: generate task codes
String taskCodeUrl = NEW_URL + "/dolphinscheduler/projects/" + newProjectCode + "/task-definition/gen-task-codes";
HashMap<String, String> taskCodeMap = new HashMap<>();
// Generate n task codes
taskCodeMap.put("genNum", String.valueOf(taskCnt));
String codeData = httpClientUtilNew.doGetRequest(taskCodeUrl, taskCodeMap);
Object codes = JSON.parseObject(codeData).get("data");
JSONArray taskCodeArr = JSON.parseArray(codes.toString());
// Add downstream input parameters based on actual task requirements
for (int i = 0; i < taskDefinitionList.size(); i++) {
JSONObject logTask = (JSONObject) taskDefinitionList.get(i);
if (“Condition Logic”)) {
JSONObject taskParams = logTask.getJSONObject("taskParams");
JSONArray localParams = taskParams.getJSONArray("localParams");
JSONObject hiveParam = new JSONObject();
hiveParam.put("prop", "hiveAmount");
hiveParam.put("direct", "IN");
hiveParam.put("type", "VARCHAR");
hiveParam.put("value", "");
localParams.add(hiveParam);
logTask.put("taskParamList", localParams);
JSONObject paramMap = new JSONObject();
for (Object obj : localParams) {
JSONObject param = (JSONObject) obj;
paramMap.put(param.getString("prop"), param.getString("value"));
}
logTask.put("taskParamMap", paramMap);
....
// Replace required parameters such as task code and SeaTunnel execution engine
for (int i = 0; i < taskCodeList.size(); i++) {
String oldCode = taskCodeList.get(i);
String newCode = taskCodeArr.getString(i);
// Replace task code
// Replace SeaTunnel engine: "SPARK" -> "start-seatunnel-spark.sh"
taskDefinitionListJsonStr = taskDefinitionListJsonStr
.replace("\"code\":" + oldCode + ",", "\"code\":" + newCode + ",")
.replace("\"engine\":\"SPARK\",", "\"startupScript\":\"start-seatunnel-spark.sh\",");
taskRelationListJsonStr = taskRelationListJsonStr
.replace("TaskCode\":" + oldCode + ",", "TaskCode\":" + newCode + ",");
locationsJsonStr = locationsJsonStr
.replace(oldCode, newCode);
...
}
Map<String, String> map = new HashMap<>();
map.put("taskDefinitionJson", taskDefinitionListJsonStr);
map.put("taskRelationJson", taskRelationListJsonStr);
map.put("locations", locationsJsonStr);
map.put("name", processDefinition.getString("name"));
map.put("tenantCode", "omm");
map.put("executionType", processDefinition.getString("executionType"));
map.put("description",
processDefinition.getString("description") == null ? "" : processDefinition.getString("description"));
map.put("globalParams", processDefinition.getString("globalParams"));
map.put("timeout", processDefinition.getString("timeout"));
String processDefinitionUrl = NEW_URL + "/dolphinscheduler/projects/" + newProjectCode + "/workflow-definition";
String processDefinitionRes = httpClientUtilNew.doPostRequest(processDefinitionUrl, map);
3.4 Migration Execution
3.4.1 Migration Procedure
Back Up Old Scheduling Tables:
Example:t_ds_schedules_20260416_10Select a Pilot Project:
Choose a project with moderate workload and limited business impact.Migrate Workflow Definitions:
Migrate approximately 200 workflow definitions, including scheduling configurations.Deploy Workflows Without Enabling Schedules:
Deploy workflows first without activating schedules.Manual Validation:
Execute workflows manually in batches and verify conflicts with the original cluster. Since most workflows run daily, hourly, or every 15 minutes, conflicts were minimal.Investigate Failed Tasks:
Analyze root causes, fix issues, and rerun failed workflows.Enable Scheduling Configurations:
Enable schedules after all workflows pass validation.Disable Old Cluster Scheduling:
After confirming stable operation in the new environment, disable corresponding schedules in the old cluster.Migrate Project by Project, Batch by Batch
3.4.2 Migration Execution Results
Following this process, we first selected a representative project containing 199 workflows. After migration, it was tested in production for one week without issues.
Subsequently, we completed migration for 50 projects, totaling approximately 3,700 workflows, within about 10 days.
- Tracking Table
| No. | Project Name | Project Code | Workflow Count | Progress | Remarks |
|---|---|---|---|---|---|
| 1 | Project 1 | 13******* | 199 | Completed on 04/16 | One-week stability test completed |
| 2 | Project 2 | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
| 50 | ... | ... | ... | ... | ... |
3.4.3 Runtime Status
The new cluster has now been running for nearly one month.
Previously, scheduling delays ranged from 10 seconds to over one minute in severe cases. After the upgrade, scheduling latency has been virtually eliminated.
Issues related to missing scheduling instances have also not reoccurred.
So far, the system has been running smoothly without any identified problems, and continuous monitoring remains in place.
4. Risk Control and Contingency Planning
4.1 Major Risks
Risk of Data Loss:
Some configurations could potentially be missed during migration.
Compatibility Issues:
Certain configurations supported in the old version may not be supported in the new version.
Business Interruption Risks:
Scheduling delays could occur during the switchover process.
4.2 Contingency Plans
4.2.1 Rollback Strategy
- Immediately stop scheduling in the new environment.
- Restore scheduling services in the old environment.
- Analyze root causes and retry after issue resolution.
4.2.2 Data Backup
- Perform a complete backup of the old environment database.
- Back up the initial configuration of the new environment.
5. Conclusion
5.1 Project Outcomes
- Successfully completed a cross-version upgrade with zero business interruption.
- Automated migration scripts significantly improved efficiency and reduced manual errors.
- The new version delivered major performance gains and substantial stability improvements.
5.2 Lessons Learned
- Major version upgrades require comprehensive evaluation of architectural changes.
- API-based migration is highly suitable for configuration migration, though parameter compatibility must be handled carefully.
- Thorough testing and validation are critical to success.
- A robust monitoring system is essential for operational stability.
- Comprehensive documentation is invaluable for long-term maintenance.
5.3 Future Plans
- The current upgrade primarily addressed DolphinScheduler scheduling bottlenecks. To align with upcoming Spark cluster upgrades, the next step will be upgrading SeaTunnel from version 2.1.3 to 2.3.12, most likely using the same migration methodology.
- Explore automated testing solutions.
- Share migration experience with other teams.

Top comments (0)