DEV Community

Apache SeaTunnel
Apache SeaTunnel

Posted on

Worried about the cost of migrating from DataX to Apache SeaTunnel? A step-by-step guide for a smooth transfer

Many teams using DataX face high maintenance costs and limited scalability, yet worry about migration overhead. This article starts from DataX users’ real needs, introducing how to quickly get started with Apache SeaTunnel. With principle analysis, configuration comparison, and automation tools, you can migrate DataX tasks to SeaTunnel quickly and cost-effectively.

References:

1. Automation Tool: X2SeaTunnel

To simplify migration, the SeaTunnel community provides a powerful automated configuration conversion tool — X2SeaTunnel. It can convert DataX JSON configs into SeaTunnel Config files with one click.

1.1 Tool Overview

X2SeaTunnel is part of the seatunnel-tools project, designed to help users migrate from other data integration platforms to SeaTunnel quickly.

Standard Config Conversion: DataX JSON → SeaTunnel Config in one step.
Custom Templates: Supports user-defined templates for special requirements.
Batch Conversion: Converts all configs in a folder and generates a migration report automatically.
Detailed Report: Markdown report with field mapping stats and potential warnings.

1.2 Quick Start

1.2.1 Download & Install
Download from GitHub Releases or build from source:

# Build from source
git clone https://github.com/apache/seatunnel-tools.git
cd seatunnel-tools
mvn clean package -pl x2seatunnel -DskipTests
# The compiled package is located at x2seatunnel/target/x2seatunnel-*.zip
Enter fullscreen mode Exit fullscreen mode

1.2.2 Conversion Example

# Convert datax.json to seatunnel.conf
./bin/x2seatunnel.sh \
    -s examples/source/datax-mysql2hdfs.json \
    -t examples/target/mysql2hdfs-result.conf \
    -r examples/report/mysql2hdfs-report.md
Enter fullscreen mode Exit fullscreen mode

1.2.3 View Report
After conversion, check the Markdown report for detailed field mapping and warnings.

2. Deep Dive: Tool Principles Comparison

2.1 DataX Principles

DataX is Alibaba’s open-source offline data sync tool with a Framework + Plugin architecture.

  • Execution Mode: Single-machine multithreading (Standalone), limited by JVM memory & CPU.
  • Core Model: ReaderChannelWriter.
  • Pros/Cons:

    • ✅ Easy to use, rich plugin ecosystem, suitable for small offline sync.
    • Single-node bottleneck: Hard to scale for massive data.
    • No fault tolerance: Failed tasks usually require full rerun, no checkpoint support.
    • Weak real-time support: Mainly designed for batch processing.

2.2 SeaTunnel Principles

Apache SeaTunnel is a next-gen, high-performance, distributed data integration framework.

  • Execution Mode: Distributed cluster, supports Zeta, Flink, Spark engines.
  • Core Model: SourceTransformSink.
  • Pros/Cons:

    • Distributed execution: Tasks can be split into multiple SubTasks for parallel execution, throughput scales with cluster size.
    • CDC support: Native support for MySQL, PostgreSQL, MongoDB CDC real-time sync.
    • Checkpoint/Resume: Chandy-Lamport based mechanism ensures exactly-once delivery.
    • Multi-engine support: Same code can run on Zeta/Flink/Spark seamlessly.
Feature DataX SeaTunnel
Architecture Standalone Distributed
Config Format JSON HOCON (JSON-compatible, supports comments)
Real-time / CDC Weak Native support
Fault Tolerance Full rerun on failure Checkpoint & resume
Transform Capabilities Limited Powerful (SQL, Filter, Split, Replace, etc.)

3. Typical Case: MySQL Migration

Show how a typical DataX MySQL→MySQL task is migrated to SeaTunnel with annotated configs.

3.1 DataX Job Config (job.json)

{
    "job": {
        "setting": {
            "speed": {
                "channel": 1
            }
        },
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": "root",
                        "password": "root",
                        "column": ["id", "name", "age"],
                        "connection": [{
                            "table": ["source_table"],
                            "jdbcUrl": ["jdbc:mysql://localhost:3306/source_db"]
                        }]
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "writeMode": "insert",
                        "username": "root",
                        "password": "root",
                        "column": ["id", "name", "age"],
                        "connection": [{
                            "table": ["target_table"],
                            "jdbcUrl": ["jdbc:mysql://localhost:3306/target_db"]
                        }]
                    }
                }
            }
        ]
    }
}
Enter fullscreen mode Exit fullscreen mode

3.2 SeaTunnel Job Config (mysql_to_mysql.conf)

env {
  execution.parallelism = 1
  job.mode = "BATCH"
}

source {
  Jdbc {
    driver = "com.mysql.cj.jdbc.Driver"
    url = "jdbc:mysql://localhost:3306/source_db"
    user = "root"
    password = "root"
    query = "select id, name, age from source_table"
    result_table_name = "mysql_source"
  }
}

sink {
  Jdbc {
    driver = "com.mysql.cj.jdbc.Driver"
    url = "jdbc:mysql://localhost:3306/target_db"
    user = "root"
    password = "root"
    source_table_name = "mysql_source"
    query = "insert into target_table (id, name, age) values (?, ?, ?)"
  }
}
Enter fullscreen mode Exit fullscreen mode

3.3 Key Mapping

Module DataX SeaTunnel Description
Global job.setting.speed.channel env.execution.parallelism Task concurrency.
Reader/Source reader.name source.plugin_name Plugin mapping (Jdbc).
parameter.jdbcUrl url Database URL.
parameter.username user DB username.
parameter.column + table query SeaTunnel uses SQL directly.
(none) result_table_name Virtual table name output by Source.
Writer/Sink writer.name sink.plugin_name Plugin mapping (Jdbc).
parameter.writeMode SQL-based SQL controls insert/upsert behavior.
parameter.preSql/postSql pre_sql/post_sql SQL hooks supported.
(none) source_table_name Must match Source’s result_table_name.

4. Running the MySQL Migration Task

Save the config as config/mysql_to_mysql.conf.

# Local development mode
./bin/seatunnel.sh --config ./config/mysql_to_mysql.conf -e local

# Cluster production mode (Zeta Engine)
./bin/seatunnel.sh --config ./config/mysql_to_mysql.conf -e cluster
Enter fullscreen mode Exit fullscreen mode

Check logs and verify target table content matches source.

5. Advanced Feature: MySQL CDC

5.1 Why SeaTunnel CDC?

DataX only supports offline batch sync. SeaTunnel CDC supports:

  • Checkpoint/resume: restart without data loss.
  • Dynamic table addition: no restart needed.
  • Lock-free reads: minimal impact on source.

5.2 MySQL CDC Config (mysql_cdc.conf)

env {
  job.mode = "STREAMING"
  checkpoint.interval = 5000
}

source {
  MySQL-CDC {
    result_table_name = "mysql_cdc_source"
    base-url = "jdbc:mysql://localhost:3306/source_db"
    username = "root"
    password = "root"
    table-names = ["source_db.source_table"]
    startup.mode = "initial"
  }
}

sink {
  Jdbc {
    source_table_name = "mysql_cdc_source"
    driver = "com.mysql.cj.jdbc.Driver"
    url = "jdbc:mysql://localhost:3306/target_db"
    user = "root"
    password = "root"
    generate_sink_sql = true
    primary_keys = ["id"]
    database = "target_db"
    table = "target_table"
  }
}
Enter fullscreen mode Exit fullscreen mode

Summary

Migrating from DataX to Apache SeaTunnel is straightforward. Clear configs and automated tools like X2SeaTunnel make the process fast and smooth. SeaTunnel also brings better performance, scalability, and advanced features like CDC for modern data pipelines.

Top comments (0)