DEV Community

Apache SeaTunnel
Apache SeaTunnel

Posted on

Demo: Full Data Synchronization from MySQL CDC to PostgreSQL with Apache SeaTunnel

This article provides a detailed walkthrough on using Apache SeaTunnel 2.3.9 to perform full data synchronization from MySQL CDC to PostgreSQL.

For the complete demonstration, please refer to the video:

Demo: Synchronizing Data from MySQL CDC to PostgreSQL with Apache SeaTunnel

Without further ado, let's dive into the MySQL-to-PostgreSQL synchronization scenario.

Version Requirements

  • MySQL → MySQL 8.3
  • PostgreSQL → PostgreSQL 13.2
  • Apache SeaTunnel → Apache SeaTunnel 2.3.9

All configuration files used in this article are available by replying with the keyword "Demo 01" on our WeChat official account.

Prerequisites

Verify Version Information

-- Check version information
SELECT VERSION();
Enter fullscreen mode Exit fullscreen mode

Enable Replication

-- Check replication-related configurations
SHOW VARIABLES
WHERE variable_name IN (
'log_bin',
'binlog_format',
'binlog_row_image',
'gtid_mode',
'enforce_gtid_consistency'
);
Enter fullscreen mode Exit fullscreen mode

MySQL CDC synchronization relies on reading MySQL binlog files. SeaTunnel cluster nodes act as replication replicas within the MySQL replication architecture.

Therefore, before configuring CDC synchronization, you must verify that:

  • Binary logging (binlog) is enabled.
  • Replication mode is enabled.

Note: For MySQL 8.0 and later versions, binlog is enabled by default. However, replication-related settings still need to be configured manually.

-- Enable replication (execute sequentially)
-- SET GLOBAL gtid_mode=OFF;
-- SET GLOBAL enforce_gtid_consistency=OFF;

SET GLOBAL gtid_mode=OFF_PERMISSIVE;
SET GLOBAL gtid_mode=ON_PERMISSIVE;
SET GLOBAL enforce_gtid_consistency=ON;
SET GLOBAL gtid_mode=ON;
Enter fullscreen mode Exit fullscreen mode

Configure User Permissions

The CDC user must have replication privileges.

The two essential permissions are:

  • REPLICATION SLAVE
  • REPLICATION CLIENT

After granting permissions, refresh privileges.

CREATE USER 'test'@'%' IDENTIFIED BY 'password';

GRANT SELECT,
RELOAD,
SHOW DATABASES,
REPLICATION SLAVE,
REPLICATION CLIENT
ON *.*
TO 'test';

FLUSH PRIVILEGES;
Enter fullscreen mode Exit fullscreen mode

SeaTunnel Cluster Configuration

Cluster Logging

Independent Log Files for Each Job

Configuration file:

log4j2.properties
Enter fullscreen mode Exit fullscreen mode

By default, SeaTunnel outputs logs into a centralized log file.

However, in production environments, job management is typically performed on a per-job basis. Therefore, it is recommended to configure independent log files for each job.

This approach provides several advantages:

  • Easier monitoring
  • Faster troubleshooting
  • Better operational visibility
  • More efficient job-level management

Modify the configuration as follows:

############################ log output to file #############################
# rootLogger.appenderRef.file.ref = fileAppender

# Output logs into independent files for each job
rootLogger.appenderRef.file.ref = routingAppender

############################ log output to file #############################
Enter fullscreen mode Exit fullscreen mode

Client Configuration

In production environments, SeaTunnel is typically installed under the /opt directory.

It is recommended to point SEATUNNEL_HOME to:

/opt/seatunnel
Enter fullscreen mode Exit fullscreen mode

If multiple versions are installed or the installation path differs, it is recommended to create a symbolic link so that client and server environments remain consistent.

This helps avoid classpath-related issues and missing dependency errors.

SeaTunnel's submission scripts reference client-side environment variables, including classpaths and installation directories, using absolute paths.

If client and server configurations differ, job submission may fail.

# Create symbolic link
ln -s /opt/apache-seatunnel-2.3.9 /opt/seatunnel

# Configure environment variable
export SEATUNNEL_HOME=/opt/seatunnel
Enter fullscreen mode Exit fullscreen mode

Environment Variable Configuration

For Linux servers, it is recommended to configure environment variables using the official approach by placing them under:

/etc/profile.d
Enter fullscreen mode Exit fullscreen mode

Example:

echo 'export SEATUNNEL_HOME=/opt/seatunnel' >> /etc/profile.d/seatunnel.sh

echo 'export PATH=$SEATUNNEL_HOME/bin:$PATH' >> /etc/profile.d/seatunnel.sh

source /etc/profile.d/seatunnel.sh
Enter fullscreen mode Exit fullscreen mode

Job Configuration

The following examples do not cover every available option. Instead, they focus on commonly used production configurations.

env {
  job.mode = "STREAMING"
  job.name = "DEMO"
  parallelism = 3

  checkpoint.interval = 30000
  checkpoint.timeout = 30000

  job.retry.times = 3
  job.retry.interval.seconds = 3
}
Enter fullscreen mode Exit fullscreen mode

Let's start with the env section.

Because CDC synchronization is a streaming workload, the job mode must be configured as:

STREAMING
Enter fullscreen mode Exit fullscreen mode

Next, configure the job name.

In production environments, it is recommended to use meaningful naming conventions based on database names or table names. This makes job identification and management much easier.

Parallelism

The example uses:

parallelism = 3
Enter fullscreen mode Exit fullscreen mode

The optimal parallelism value depends on:

  • Cluster size
  • Available resources
  • Database performance characteristics

Checkpoints

The checkpoint interval is configured as:

30 seconds
Enter fullscreen mode Exit fullscreen mode

If lower recovery latency is required, the interval can be reduced to 10 seconds or even lower.

Checkpoint timeout is also configured as:

30 seconds
Enter fullscreen mode Exit fullscreen mode

If checkpoint creation exceeds this threshold, the job is considered failed.

Retry Policy

The example configures:

  • Retry attempts: 3
  • Retry interval: 3 seconds

These values can be adjusted according to production requirements.

MySQL CDC Source Configuration

The MySQL CDC source configuration is one of the most important parts of the entire synchronization task.

source {
  MySQL-CDC {
    base-url = "jdbc:mysql://192.168.8.101:3306/test?serverTimezone=Asia/Shanghai"
    username = "test"
    password = "123456"

    database-names = ["test"]

    # table-names = ["test.test_001","test.test_002"]

    table-pattern = "test\\.test_.*"

    table-names-config = [
      {
        "table":"test.test_002",
        "primaryKeys":["id"]
      }
    ]

    startup.mode = "initial"

    snapshot.split.size = "8096"
    snapshot.fetch.size = "1024"

    server-id = "6500-8500"

    connect.timeout.ms = 30000
    connect.max-retries = 3
    connection.pool.size = 20

    exactly_once = false

    schema-changes.enabled = true
  }
}
Enter fullscreen mode Exit fullscreen mode

One important recommendation is to explicitly specify the timezone in the MySQL JDBC URL:

serverTimezone=Asia/Shanghai
Enter fullscreen mode Exit fullscreen mode

This helps prevent timezone-related inconsistencies when extracting DATETIME and TIMESTAMP fields.

The username and password should belong to an account with replication permissions, capable of reading binlogs and accessing all tables involved in synchronization.

In most production environments, each synchronization task is typically configured for a specific database. Therefore, it is recommended to explicitly specify the database being synchronized:

database-names = ["test"]
Enter fullscreen mode Exit fullscreen mode

Table Selection

SeaTunnel provides two approaches for selecting source tables:

Explicit Table List

table-names = [
  "test.test_001",
  "test.test_002"
]
Enter fullscreen mode Exit fullscreen mode

Regular Expression Matching

table-pattern = "test\\.test_.*"
Enter fullscreen mode Exit fullscreen mode

For large-scale initial synchronization scenarios involving many tables—or even an entire database—regular expressions are often the preferred option.

When using regular expressions, both the database name and table name must be included.

For example:

test\\.test_.*
Enter fullscreen mode Exit fullscreen mode

The escaped dot (\\.) represents the literal separator between the database name and table name.

The pattern above matches all tables in the test database whose names begin with:

test_
Enter fullscreen mode Exit fullscreen mode

This approach is especially useful for large-scale database synchronization projects.

Table-Level Custom Configuration

Additional table-specific settings can also be defined.

For example:

table-names-config = [
  {
    "table":"test.test_002",
    "primaryKeys":["id"]
  }
]
Enter fullscreen mode Exit fullscreen mode

Suppose test_002 does not have a physical primary key.

In that case, a logical primary key can be specified manually to support synchronization and downstream processing.

Startup Mode

One of the most important options is:

startup.mode = "initial"
Enter fullscreen mode Exit fullscreen mode

This is also the most common production configuration.

The initial mode performs:

  1. Historical full synchronization
  2. Continuous CDC synchronization

In other words:

Full Load + Incremental CDC

This approach is widely used when onboarding existing databases into real-time synchronization pipelines.

Snapshot Configuration

The following options control snapshot processing:

snapshot.split.size = "8096"
snapshot.fetch.size = "1024"
Enter fullscreen mode Exit fullscreen mode

The default values work well in most cases.

For larger clusters or more powerful servers, these values can be adjusted to improve throughput.

Server ID Configuration

Another critical parameter is:

server-id = "6500-8500"
Enter fullscreen mode Exit fullscreen mode

When SeaTunnel consumes MySQL binlogs, it behaves like a MySQL replica node.

MySQL replication requires every replica to have a unique server-id.

If not specified, a default value will be used.

However, the official recommendation is to configure a dedicated server-id range.

An important requirement is:

The server-id range must be larger than the configured parallelism; otherwise, synchronization tasks may fail during startup.

Connection Settings

The following parameters control connectivity:

connect.timeout.ms = 30000
connect.max-retries = 3
connection.pool.size = 20
Enter fullscreen mode Exit fullscreen mode

For large datasets or slower networks, increasing timeout values may be beneficial.

Similarly, connection pool size can be increased for high-throughput synchronization workloads.

Exactly-Once Semantics

exactly_once = false
Enter fullscreen mode Exit fullscreen mode

For most CDC analytics scenarios, strict transactional consistency is not required.

Therefore, disabling Exactly-Once semantics is often recommended because it can significantly improve synchronization performance.

If strong consistency guarantees are required by the business, this option can be enabled.

However, enabling Exactly-Once generally introduces additional overhead and may reduce throughput.

Schema Evolution

Another highly recommended feature is:

schema-changes.enabled = true
Enter fullscreen mode Exit fullscreen mode

Schema Evolution allows SeaTunnel to automatically adapt to changes in source table structures.

For example:

  • Add Column
  • Drop Column
  • Rename Column
  • Modify Column

This reduces the need to manually modify synchronization jobs whenever schema changes occur.

However, it also introduces certain considerations.

For example, if a downstream application depends on a column name that is automatically renamed, related SQL statements may fail.

Therefore, users should balance automation and downstream compatibility according to their own requirements.

According to the official documentation, Schema Evolution currently supports:

  • add column
  • drop column
  • rename column
  • modify column

Not all DDL operations are supported.

For example:

  • CREATE TABLE
  • DROP TABLE

cannot currently be captured and propagated automatically.

Despite these limitations, Schema Evolution remains an extremely valuable capability and is highly recommended for most production environments.

PostgreSQL Sink Configuration

sink {
  jdbc {
    url = "jdbc:postgresql://192.168.8.101:5432/test"
    driver = "org.postgresql.Driver"

    user = "postgres"
    password = "123456"

    generate_sink_sql = true

    database = "test"

    table = "${database_name}.${table_name}"

    schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"

    data_save_mode = "APPEND_DATA"

    # enable_upsert = false
  }
}
Enter fullscreen mode Exit fullscreen mode

The Sink configuration writes synchronized data into PostgreSQL.

In addition to the connection URL, JDBC driver, username, and password, one particularly useful feature is:

generate_sink_sql = true
Enter fullscreen mode Exit fullscreen mode

When enabled, SeaTunnel automatically generates:

  • CREATE TABLE statements
  • INSERT statements
  • DELETE statements
  • UPDATE statements based on primary keys

This greatly simplifies synchronization configuration and eliminates the need for complex manual SQL development.

Understanding PostgreSQL Schema Hierarchy

When synchronizing data to PostgreSQL, it is important to understand the difference between MySQL and PostgreSQL object hierarchies.

MySQL contains:

Database → Table
Enter fullscreen mode Exit fullscreen mode

PostgreSQL contains:

Database → Schema → Table
Enter fullscreen mode Exit fullscreen mode

Therefore, if all synchronized tables should be stored in a specific PostgreSQL database, this must be configured appropriately.

It is also recommended that the PostgreSQL user have table creation permissions if automatic table creation is enabled.

Placeholder Support

SeaTunnel provides a powerful placeholder mechanism:

table = "${database_name}.${table_name}"
Enter fullscreen mode Exit fullscreen mode

This feature is especially useful when synchronizing many tables.

Instead of manually defining every target table name, placeholders automatically generate target table mappings based on source metadata.

This significantly reduces maintenance effort in large-scale synchronization projects.

Save Modes

Schema Save Mode

schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"
Enter fullscreen mode Exit fullscreen mode

This option is extremely useful for whole-database synchronization.

It automatically creates target schemas and tables when they do not already exist.

As a result, users can avoid many manual setup steps.

Data Save Mode

data_save_mode = "APPEND_DATA"
Enter fullscreen mode Exit fullscreen mode

APPEND_DATA prevents existing synchronized data from being overwritten.

This makes it one of the safest and most commonly used modes in production environments.

Other save modes are also available and can be selected according to business requirements.

Upsert Configuration

enable_upsert
Enter fullscreen mode Exit fullscreen mode

If you can guarantee that source-side data will never contain duplicate primary keys, disabling upsert operations may significantly improve synchronization performance.

However, if duplicate records are possible, it is recommended to keep Upsert enabled.

SeaTunnel can then perform primary-key-based updates automatically.

Refer to the official documentation for detailed parameter descriptions and supported behaviors.

Job Submission and Monitoring

After completing the configuration file, the synchronization job can be submitted using SeaTunnel's command-line tools.

Example:

./bin/start-seatunnel.sh --config /path/to/config.yaml --async
Enter fullscreen mode Exit fullscreen mode

Parameter description:

  • config — Specifies the configuration file path.
  • async — Submits the job asynchronously. After submission, the command-line process exits immediately while the job continues running in the background.

After submission, the job can be monitored through the SeaTunnel UI.

Starting from version 2.3.9, SeaTunnel provides an intuitive web interface that allows users to view:

  • Job status
  • Execution logs
  • Throughput statistics
  • Data processing metrics

Synchronization Demonstration

In this demo, two tables were created:

test_001
test_002
Enter fullscreen mode Exit fullscreen mode

After inserting sample records into MySQL, SeaTunnel successfully synchronized the data into PostgreSQL.

The demonstration also covers:

  • INSERT operations
  • DELETE operations
  • UPDATE operations
  • Schema changes

SeaTunnel successfully captures and synchronizes all these changes to PostgreSQL in real time.

Key Takeaways

Automatic Schema Synchronization

SeaTunnel supports automatic schema synchronization.

When the schema of a source MySQL table changes, the corresponding PostgreSQL table structure can be updated automatically as well.

Data Consistency

SeaTunnel ensures data consistency throughout the synchronization process.

All INSERT, DELETE, and UPDATE operations are accurately replicated to the target database, providing a reliable foundation for real-time analytics and data integration scenarios.

Top comments (0)