Apache SeaTunnel

Posted on Feb 6

Worried about the cost of migrating from DataX to Apache SeaTunnel? A step-by-step guide for a smooth transfer

#data #seatunnel #datascience #opensource

Many teams using DataX face high maintenance costs and limited scalability, yet worry about migration overhead. This article starts from DataX users’ real needs, introducing how to quickly get started with Apache SeaTunnel. With principle analysis, configuration comparison, and automation tools, you can migrate DataX tasks to SeaTunnel quickly and cost-effectively.

References:

1. Automation Tool: X2SeaTunnel

To simplify migration, the SeaTunnel community provides a powerful automated configuration conversion tool — X2SeaTunnel. It can convert DataX JSON configs into SeaTunnel Config files with one click.

1.1 Tool Overview

X2SeaTunnel is part of the seatunnel-tools project, designed to help users migrate from other data integration platforms to SeaTunnel quickly.

✅ Standard Config Conversion: DataX JSON → SeaTunnel Config in one step.
✅ Custom Templates: Supports user-defined templates for special requirements.
✅ Batch Conversion: Converts all configs in a folder and generates a migration report automatically.
✅ Detailed Report: Markdown report with field mapping stats and potential warnings.

1.2 Quick Start

1.2.1 Download & Install
Download from GitHub Releases or build from source:

# Build from source
git clone https://github.com/apache/seatunnel-tools.git
cd seatunnel-tools
mvn clean package -pl x2seatunnel -DskipTests
# The compiled package is located at x2seatunnel/target/x2seatunnel-*.zip

1.2.2 Conversion Example

# Convert datax.json to seatunnel.conf
./bin/x2seatunnel.sh \
    -s examples/source/datax-mysql2hdfs.json \
    -t examples/target/mysql2hdfs-result.conf \
    -r examples/report/mysql2hdfs-report.md

1.2.3 View Report
After conversion, check the Markdown report for detailed field mapping and warnings.

2. Deep Dive: Tool Principles Comparison

2.1 DataX Principles

DataX is Alibaba’s open-source offline data sync tool with a Framework + Plugin architecture.

Execution Mode: Single-machine multithreading (Standalone), limited by JVM memory & CPU.
Core Model: Reader → Channel → Writer.
Pros/Cons:
- ✅ Easy to use, rich plugin ecosystem, suitable for small offline sync.
- ❌ Single-node bottleneck: Hard to scale for massive data.
- ❌ No fault tolerance: Failed tasks usually require full rerun, no checkpoint support.
- ❌ Weak real-time support: Mainly designed for batch processing.

2.2 SeaTunnel Principles

Apache SeaTunnel is a next-gen, high-performance, distributed data integration framework.

Execution Mode: Distributed cluster, supports Zeta, Flink, Spark engines.
Core Model: Source → Transform → Sink.
Pros/Cons:
- ✅ Distributed execution: Tasks can be split into multiple SubTasks for parallel execution, throughput scales with cluster size.
- ✅ CDC support: Native support for MySQL, PostgreSQL, MongoDB CDC real-time sync.
- ✅ Checkpoint/Resume: Chandy-Lamport based mechanism ensures exactly-once delivery.
- ✅ Multi-engine support: Same code can run on Zeta/Flink/Spark seamlessly.

Feature	DataX	SeaTunnel
Architecture	Standalone	Distributed
Config Format	JSON	HOCON (JSON-compatible, supports comments)
Real-time / CDC	Weak	Native support
Fault Tolerance	Full rerun on failure	Checkpoint & resume
Transform Capabilities	Limited	Powerful (SQL, Filter, Split, Replace, etc.)

3. Typical Case: MySQL Migration

Show how a typical DataX MySQL→MySQL task is migrated to SeaTunnel with annotated configs.

3.1 DataX Job Config (job.json)

{
    "job": {
        "setting": {
            "speed": {
                "channel": 1
            }
        },
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": "root",
                        "password": "root",
                        "column": ["id", "name", "age"],
                        "connection": [{
                            "table": ["source_table"],
                            "jdbcUrl": ["jdbc:mysql://localhost:3306/source_db"]
                        }]
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "writeMode": "insert",
                        "username": "root",
                        "password": "root",
                        "column": ["id", "name", "age"],
                        "connection": [{
                            "table": ["target_table"],
                            "jdbcUrl": ["jdbc:mysql://localhost:3306/target_db"]
                        }]
                    }
                }
            }
        ]
    }
}

3.2 SeaTunnel Job Config (mysql_to_mysql.conf)

env {
  execution.parallelism = 1
  job.mode = "BATCH"
}

source {
  Jdbc {
    driver = "com.mysql.cj.jdbc.Driver"
    url = "jdbc:mysql://localhost:3306/source_db"
    user = "root"
    password = "root"
    query = "select id, name, age from source_table"
    result_table_name = "mysql_source"
  }
}

sink {
  Jdbc {
    driver = "com.mysql.cj.jdbc.Driver"
    url = "jdbc:mysql://localhost:3306/target_db"
    user = "root"
    password = "root"
    source_table_name = "mysql_source"
    query = "insert into target_table (id, name, age) values (?, ?, ?)"
  }
}

3.3 Key Mapping

Module	DataX	SeaTunnel	Description
Global	`job.setting.speed.channel`	`env.execution.parallelism`	Task concurrency.
Reader/Source	`reader.name`	`source.plugin_name`	Plugin mapping (Jdbc).
	`parameter.jdbcUrl`	`url`	Database URL.
	`parameter.username`	`user`	DB username.
	`parameter.column + table`	`query`	SeaTunnel uses SQL directly.
	(none)	`result_table_name`	Virtual table name output by Source.
Writer/Sink	`writer.name`	`sink.plugin_name`	Plugin mapping (Jdbc).
	`parameter.writeMode`	SQL-based	SQL controls insert/upsert behavior.
	`parameter.preSql/postSql`	`pre_sql/post_sql`	SQL hooks supported.
	(none)	`source_table_name`	Must match Source’s result_table_name.

4. Running the MySQL Migration Task

Save the config as config/mysql_to_mysql.conf.

# Local development mode
./bin/seatunnel.sh --config ./config/mysql_to_mysql.conf -e local

# Cluster production mode (Zeta Engine)
./bin/seatunnel.sh --config ./config/mysql_to_mysql.conf -e cluster

Check logs and verify target table content matches source.

5. Advanced Feature: MySQL CDC

5.1 Why SeaTunnel CDC?

DataX only supports offline batch sync. SeaTunnel CDC supports:

Checkpoint/resume: restart without data loss.
Dynamic table addition: no restart needed.
Lock-free reads: minimal impact on source.

5.2 MySQL CDC Config (mysql_cdc.conf)

env {
  job.mode = "STREAMING"
  checkpoint.interval = 5000
}

source {
  MySQL-CDC {
    result_table_name = "mysql_cdc_source"
    base-url = "jdbc:mysql://localhost:3306/source_db"
    username = "root"
    password = "root"
    table-names = ["source_db.source_table"]
    startup.mode = "initial"
  }
}

sink {
  Jdbc {
    source_table_name = "mysql_cdc_source"
    driver = "com.mysql.cj.jdbc.Driver"
    url = "jdbc:mysql://localhost:3306/target_db"
    user = "root"
    password = "root"
    generate_sink_sql = true
    primary_keys = ["id"]
    database = "target_db"
    table = "target_table"
  }
}

Summary

Migrating from DataX to Apache SeaTunnel is straightforward. Clear configs and automated tools like X2SeaTunnel make the process fast and smooth. SeaTunnel also brings better performance, scalability, and advanced features like CDC for modern data pipelines.

DEV Community