Apache SeaTunnel

Posted on Jun 11

Demo: Full Data Synchronization from MySQL CDC to PostgreSQL with Apache SeaTunnel

#mysql #postgres #apacheseatunnel #datascience

This article provides a detailed walkthrough on using Apache SeaTunnel 2.3.9 to perform full data synchronization from MySQL CDC to PostgreSQL.

For the complete demonstration, please refer to the video:

Demo: Synchronizing Data from MySQL CDC to PostgreSQL with Apache SeaTunnel

Without further ado, let's dive into the MySQL-to-PostgreSQL synchronization scenario.

Version Requirements

MySQL → MySQL 8.3
PostgreSQL → PostgreSQL 13.2
Apache SeaTunnel → Apache SeaTunnel 2.3.9

All configuration files used in this article are available by replying with the keyword "Demo 01" on our WeChat official account.

Prerequisites

Verify Version Information

-- Check version information
SELECT VERSION();

Enable Replication

-- Check replication-related configurations
SHOW VARIABLES
WHERE variable_name IN (
'log_bin',
'binlog_format',
'binlog_row_image',
'gtid_mode',
'enforce_gtid_consistency'
);

MySQL CDC synchronization relies on reading MySQL binlog files. SeaTunnel cluster nodes act as replication replicas within the MySQL replication architecture.

Therefore, before configuring CDC synchronization, you must verify that:

Binary logging (binlog) is enabled.
Replication mode is enabled.

Note: For MySQL 8.0 and later versions, binlog is enabled by default. However, replication-related settings still need to be configured manually.

-- Enable replication (execute sequentially)
-- SET GLOBAL gtid_mode=OFF;
-- SET GLOBAL enforce_gtid_consistency=OFF;

SET GLOBAL gtid_mode=OFF_PERMISSIVE;
SET GLOBAL gtid_mode=ON_PERMISSIVE;
SET GLOBAL enforce_gtid_consistency=ON;
SET GLOBAL gtid_mode=ON;

Configure User Permissions

The CDC user must have replication privileges.

The two essential permissions are:

REPLICATION SLAVE
REPLICATION CLIENT

After granting permissions, refresh privileges.

CREATE USER 'test'@'%' IDENTIFIED BY 'password';

GRANT SELECT,
RELOAD,
SHOW DATABASES,
REPLICATION SLAVE,
REPLICATION CLIENT
ON *.*
TO 'test';

FLUSH PRIVILEGES;

SeaTunnel Cluster Configuration

Cluster Logging

Independent Log Files for Each Job

Configuration file:

log4j2.properties

By default, SeaTunnel outputs logs into a centralized log file.

However, in production environments, job management is typically performed on a per-job basis. Therefore, it is recommended to configure independent log files for each job.

This approach provides several advantages:

Easier monitoring
Faster troubleshooting
Better operational visibility
More efficient job-level management

Modify the configuration as follows:

############################ log output to file #############################
# rootLogger.appenderRef.file.ref = fileAppender

# Output logs into independent files for each job
rootLogger.appenderRef.file.ref = routingAppender

############################ log output to file #############################

Client Configuration

In production environments, SeaTunnel is typically installed under the /opt directory.

It is recommended to point SEATUNNEL_HOME to:

/opt/seatunnel

If multiple versions are installed or the installation path differs, it is recommended to create a symbolic link so that client and server environments remain consistent.

This helps avoid classpath-related issues and missing dependency errors.

SeaTunnel's submission scripts reference client-side environment variables, including classpaths and installation directories, using absolute paths.

If client and server configurations differ, job submission may fail.

# Create symbolic link
ln -s /opt/apache-seatunnel-2.3.9 /opt/seatunnel

# Configure environment variable
export SEATUNNEL_HOME=/opt/seatunnel

Environment Variable Configuration

For Linux servers, it is recommended to configure environment variables using the official approach by placing them under:

/etc/profile.d

Example:

echo 'export SEATUNNEL_HOME=/opt/seatunnel' >> /etc/profile.d/seatunnel.sh

echo 'export PATH=$SEATUNNEL_HOME/bin:$PATH' >> /etc/profile.d/seatunnel.sh

source /etc/profile.d/seatunnel.sh

Job Configuration

The following examples do not cover every available option. Instead, they focus on commonly used production configurations.

env {
  job.mode = "STREAMING"
  job.name = "DEMO"
  parallelism = 3

  checkpoint.interval = 30000
  checkpoint.timeout = 30000

  job.retry.times = 3
  job.retry.interval.seconds = 3
}

Let's start with the env section.

Because CDC synchronization is a streaming workload, the job mode must be configured as:

STREAMING

Next, configure the job name.

In production environments, it is recommended to use meaningful naming conventions based on database names or table names. This makes job identification and management much easier.

Parallelism

The example uses:

parallelism = 3

The optimal parallelism value depends on:

Cluster size
Available resources
Database performance characteristics

Checkpoints

The checkpoint interval is configured as:

30 seconds

If lower recovery latency is required, the interval can be reduced to 10 seconds or even lower.

Checkpoint timeout is also configured as:

30 seconds

If checkpoint creation exceeds this threshold, the job is considered failed.

Retry Policy

The example configures:

Retry attempts: 3
Retry interval: 3 seconds

These values can be adjusted according to production requirements.

MySQL CDC Source Configuration

The MySQL CDC source configuration is one of the most important parts of the entire synchronization task.

source {
  MySQL-CDC {
    base-url = "jdbc:mysql://192.168.8.101:3306/test?serverTimezone=Asia/Shanghai"
    username = "test"
    password = "123456"

    database-names = ["test"]

    # table-names = ["test.test_001","test.test_002"]

    table-pattern = "test\\.test_.*"

    table-names-config = [
      {
        "table":"test.test_002",
        "primaryKeys":["id"]
      }
    ]

    startup.mode = "initial"

    snapshot.split.size = "8096"
    snapshot.fetch.size = "1024"

    server-id = "6500-8500"

    connect.timeout.ms = 30000
    connect.max-retries = 3
    connection.pool.size = 20

    exactly_once = false

    schema-changes.enabled = true
  }
}

One important recommendation is to explicitly specify the timezone in the MySQL JDBC URL:

serverTimezone=Asia/Shanghai

This helps prevent timezone-related inconsistencies when extracting DATETIME and TIMESTAMP fields.

The username and password should belong to an account with replication permissions, capable of reading binlogs and accessing all tables involved in synchronization.

In most production environments, each synchronization task is typically configured for a specific database. Therefore, it is recommended to explicitly specify the database being synchronized:

database-names = ["test"]

Table Selection

SeaTunnel provides two approaches for selecting source tables:

Explicit Table List

table-names = [
  "test.test_001",
  "test.test_002"
]

Regular Expression Matching

table-pattern = "test\\.test_.*"

For large-scale initial synchronization scenarios involving many tables—or even an entire database—regular expressions are often the preferred option.

When using regular expressions, both the database name and table name must be included.

For example:

test\\.test_.*

The escaped dot (\\.) represents the literal separator between the database name and table name.

The pattern above matches all tables in the test database whose names begin with:

test_

This approach is especially useful for large-scale database synchronization projects.

Table-Level Custom Configuration

Additional table-specific settings can also be defined.

For example:

table-names-config = [
  {
    "table":"test.test_002",
    "primaryKeys":["id"]
  }
]

Suppose test_002 does not have a physical primary key.

In that case, a logical primary key can be specified manually to support synchronization and downstream processing.

Startup Mode

One of the most important options is:

startup.mode = "initial"

This is also the most common production configuration.

The initial mode performs:

Historical full synchronization
Continuous CDC synchronization

In other words:

Full Load + Incremental CDC

This approach is widely used when onboarding existing databases into real-time synchronization pipelines.

Snapshot Configuration

The following options control snapshot processing:

snapshot.split.size = "8096"
snapshot.fetch.size = "1024"

The default values work well in most cases.

For larger clusters or more powerful servers, these values can be adjusted to improve throughput.

Server ID Configuration

Another critical parameter is:

server-id = "6500-8500"

When SeaTunnel consumes MySQL binlogs, it behaves like a MySQL replica node.

MySQL replication requires every replica to have a unique server-id.

If not specified, a default value will be used.

However, the official recommendation is to configure a dedicated server-id range.

An important requirement is:

The server-id range must be larger than the configured parallelism; otherwise, synchronization tasks may fail during startup.

Connection Settings

The following parameters control connectivity:

connect.timeout.ms = 30000
connect.max-retries = 3
connection.pool.size = 20

For large datasets or slower networks, increasing timeout values may be beneficial.

Similarly, connection pool size can be increased for high-throughput synchronization workloads.

Exactly-Once Semantics

exactly_once = false

For most CDC analytics scenarios, strict transactional consistency is not required.

Therefore, disabling Exactly-Once semantics is often recommended because it can significantly improve synchronization performance.

If strong consistency guarantees are required by the business, this option can be enabled.

However, enabling Exactly-Once generally introduces additional overhead and may reduce throughput.

Schema Evolution

Another highly recommended feature is:

schema-changes.enabled = true

Schema Evolution allows SeaTunnel to automatically adapt to changes in source table structures.

For example:

Add Column
Drop Column
Rename Column
Modify Column

This reduces the need to manually modify synchronization jobs whenever schema changes occur.

However, it also introduces certain considerations.

For example, if a downstream application depends on a column name that is automatically renamed, related SQL statements may fail.

Therefore, users should balance automation and downstream compatibility according to their own requirements.

According to the official documentation, Schema Evolution currently supports:

add column
drop column
rename column
modify column

Not all DDL operations are supported.

For example:

CREATE TABLE
DROP TABLE

cannot currently be captured and propagated automatically.

Despite these limitations, Schema Evolution remains an extremely valuable capability and is highly recommended for most production environments.

PostgreSQL Sink Configuration

sink {
  jdbc {
    url = "jdbc:postgresql://192.168.8.101:5432/test"
    driver = "org.postgresql.Driver"

    user = "postgres"
    password = "123456"

    generate_sink_sql = true

    database = "test"

    table = "${database_name}.${table_name}"

    schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"

    data_save_mode = "APPEND_DATA"

    # enable_upsert = false
  }
}

The Sink configuration writes synchronized data into PostgreSQL.

In addition to the connection URL, JDBC driver, username, and password, one particularly useful feature is:

generate_sink_sql = true

When enabled, SeaTunnel automatically generates:

CREATE TABLE statements
INSERT statements
DELETE statements
UPDATE statements based on primary keys

This greatly simplifies synchronization configuration and eliminates the need for complex manual SQL development.

Understanding PostgreSQL Schema Hierarchy

When synchronizing data to PostgreSQL, it is important to understand the difference between MySQL and PostgreSQL object hierarchies.

MySQL contains:

Database → Table

PostgreSQL contains:

Database → Schema → Table

Therefore, if all synchronized tables should be stored in a specific PostgreSQL database, this must be configured appropriately.

It is also recommended that the PostgreSQL user have table creation permissions if automatic table creation is enabled.

Placeholder Support

SeaTunnel provides a powerful placeholder mechanism:

table = "${database_name}.${table_name}"

This feature is especially useful when synchronizing many tables.

Instead of manually defining every target table name, placeholders automatically generate target table mappings based on source metadata.

This significantly reduces maintenance effort in large-scale synchronization projects.

Save Modes

Schema Save Mode

schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"

This option is extremely useful for whole-database synchronization.

It automatically creates target schemas and tables when they do not already exist.

As a result, users can avoid many manual setup steps.

Data Save Mode

data_save_mode = "APPEND_DATA"

APPEND_DATA prevents existing synchronized data from being overwritten.

This makes it one of the safest and most commonly used modes in production environments.

Other save modes are also available and can be selected according to business requirements.

Upsert Configuration

enable_upsert

If you can guarantee that source-side data will never contain duplicate primary keys, disabling upsert operations may significantly improve synchronization performance.

However, if duplicate records are possible, it is recommended to keep Upsert enabled.

SeaTunnel can then perform primary-key-based updates automatically.

Refer to the official documentation for detailed parameter descriptions and supported behaviors.

Job Submission and Monitoring

After completing the configuration file, the synchronization job can be submitted using SeaTunnel's command-line tools.

Example:

./bin/start-seatunnel.sh --config /path/to/config.yaml --async

Parameter description:

config — Specifies the configuration file path.
async — Submits the job asynchronously. After submission, the command-line process exits immediately while the job continues running in the background.

After submission, the job can be monitored through the SeaTunnel UI.

Starting from version 2.3.9, SeaTunnel provides an intuitive web interface that allows users to view:

Job status
Execution logs
Throughput statistics
Data processing metrics

Synchronization Demonstration

In this demo, two tables were created:

test_001
test_002

After inserting sample records into MySQL, SeaTunnel successfully synchronized the data into PostgreSQL.

The demonstration also covers:

INSERT operations
DELETE operations
UPDATE operations
Schema changes

SeaTunnel successfully captures and synchronizes all these changes to PostgreSQL in real time.

Key Takeaways

Automatic Schema Synchronization

SeaTunnel supports automatic schema synchronization.

When the schema of a source MySQL table changes, the corresponding PostgreSQL table structure can be updated automatically as well.

Data Consistency

SeaTunnel ensures data consistency throughout the synchronization process.

All INSERT, DELETE, and UPDATE operations are accurately replicated to the target database, providing a reliable foundation for real-time analytics and data integration scenarios.

DEV Community

Demo: Full Data Synchronization from MySQL CDC to PostgreSQL with Apache SeaTunnel

Version Requirements

Prerequisites

Verify Version Information

Enable Replication

Configure User Permissions

SeaTunnel Cluster Configuration

Cluster Logging

Independent Log Files for Each Job

Client Configuration

Environment Variable Configuration

Job Configuration

Parallelism

Checkpoints

Retry Policy

MySQL CDC Source Configuration

Table Selection

Explicit Table List

Regular Expression Matching

Table-Level Custom Configuration

Startup Mode

Snapshot Configuration

Server ID Configuration

Connection Settings

Exactly-Once Semantics

Schema Evolution

PostgreSQL Sink Configuration

Understanding PostgreSQL Schema Hierarchy

Placeholder Support

Save Modes

Schema Save Mode

Data Save Mode

Upsert Configuration

Job Submission and Monitoring

Synchronization Demonstration

Key Takeaways

Automatic Schema Synchronization

Data Consistency

Top comments (0)