This article provides a detailed walkthrough on using Apache SeaTunnel 2.3.9 to perform full data synchronization from MySQL CDC to PostgreSQL.
For the complete demonstration, please refer to the video:
Demo: Synchronizing Data from MySQL CDC to PostgreSQL with Apache SeaTunnel
Without further ado, let's dive into the MySQL-to-PostgreSQL synchronization scenario.
Version Requirements
- MySQL → MySQL 8.3
- PostgreSQL → PostgreSQL 13.2
- Apache SeaTunnel → Apache SeaTunnel 2.3.9
All configuration files used in this article are available by replying with the keyword "Demo 01" on our WeChat official account.
Prerequisites
Verify Version Information
-- Check version information
SELECT VERSION();
Enable Replication
-- Check replication-related configurations
SHOW VARIABLES
WHERE variable_name IN (
'log_bin',
'binlog_format',
'binlog_row_image',
'gtid_mode',
'enforce_gtid_consistency'
);
MySQL CDC synchronization relies on reading MySQL binlog files. SeaTunnel cluster nodes act as replication replicas within the MySQL replication architecture.
Therefore, before configuring CDC synchronization, you must verify that:
- Binary logging (binlog) is enabled.
- Replication mode is enabled.
Note: For MySQL 8.0 and later versions, binlog is enabled by default. However, replication-related settings still need to be configured manually.
-- Enable replication (execute sequentially)
-- SET GLOBAL gtid_mode=OFF;
-- SET GLOBAL enforce_gtid_consistency=OFF;
SET GLOBAL gtid_mode=OFF_PERMISSIVE;
SET GLOBAL gtid_mode=ON_PERMISSIVE;
SET GLOBAL enforce_gtid_consistency=ON;
SET GLOBAL gtid_mode=ON;
Configure User Permissions
The CDC user must have replication privileges.
The two essential permissions are:
- REPLICATION SLAVE
- REPLICATION CLIENT
After granting permissions, refresh privileges.
CREATE USER 'test'@'%' IDENTIFIED BY 'password';
GRANT SELECT,
RELOAD,
SHOW DATABASES,
REPLICATION SLAVE,
REPLICATION CLIENT
ON *.*
TO 'test';
FLUSH PRIVILEGES;
SeaTunnel Cluster Configuration
Cluster Logging
Independent Log Files for Each Job
Configuration file:
log4j2.properties
By default, SeaTunnel outputs logs into a centralized log file.
However, in production environments, job management is typically performed on a per-job basis. Therefore, it is recommended to configure independent log files for each job.
This approach provides several advantages:
- Easier monitoring
- Faster troubleshooting
- Better operational visibility
- More efficient job-level management
Modify the configuration as follows:
############################ log output to file #############################
# rootLogger.appenderRef.file.ref = fileAppender
# Output logs into independent files for each job
rootLogger.appenderRef.file.ref = routingAppender
############################ log output to file #############################
Client Configuration
In production environments, SeaTunnel is typically installed under the /opt directory.
It is recommended to point SEATUNNEL_HOME to:
/opt/seatunnel
If multiple versions are installed or the installation path differs, it is recommended to create a symbolic link so that client and server environments remain consistent.
This helps avoid classpath-related issues and missing dependency errors.
SeaTunnel's submission scripts reference client-side environment variables, including classpaths and installation directories, using absolute paths.
If client and server configurations differ, job submission may fail.
# Create symbolic link
ln -s /opt/apache-seatunnel-2.3.9 /opt/seatunnel
# Configure environment variable
export SEATUNNEL_HOME=/opt/seatunnel
Environment Variable Configuration
For Linux servers, it is recommended to configure environment variables using the official approach by placing them under:
/etc/profile.d
Example:
echo 'export SEATUNNEL_HOME=/opt/seatunnel' >> /etc/profile.d/seatunnel.sh
echo 'export PATH=$SEATUNNEL_HOME/bin:$PATH' >> /etc/profile.d/seatunnel.sh
source /etc/profile.d/seatunnel.sh
Job Configuration
The following examples do not cover every available option. Instead, they focus on commonly used production configurations.
env {
job.mode = "STREAMING"
job.name = "DEMO"
parallelism = 3
checkpoint.interval = 30000
checkpoint.timeout = 30000
job.retry.times = 3
job.retry.interval.seconds = 3
}
Let's start with the env section.
Because CDC synchronization is a streaming workload, the job mode must be configured as:
STREAMING
Next, configure the job name.
In production environments, it is recommended to use meaningful naming conventions based on database names or table names. This makes job identification and management much easier.
Parallelism
The example uses:
parallelism = 3
The optimal parallelism value depends on:
- Cluster size
- Available resources
- Database performance characteristics
Checkpoints
The checkpoint interval is configured as:
30 seconds
If lower recovery latency is required, the interval can be reduced to 10 seconds or even lower.
Checkpoint timeout is also configured as:
30 seconds
If checkpoint creation exceeds this threshold, the job is considered failed.
Retry Policy
The example configures:
- Retry attempts: 3
- Retry interval: 3 seconds
These values can be adjusted according to production requirements.
MySQL CDC Source Configuration
The MySQL CDC source configuration is one of the most important parts of the entire synchronization task.
source {
MySQL-CDC {
base-url = "jdbc:mysql://192.168.8.101:3306/test?serverTimezone=Asia/Shanghai"
username = "test"
password = "123456"
database-names = ["test"]
# table-names = ["test.test_001","test.test_002"]
table-pattern = "test\\.test_.*"
table-names-config = [
{
"table":"test.test_002",
"primaryKeys":["id"]
}
]
startup.mode = "initial"
snapshot.split.size = "8096"
snapshot.fetch.size = "1024"
server-id = "6500-8500"
connect.timeout.ms = 30000
connect.max-retries = 3
connection.pool.size = 20
exactly_once = false
schema-changes.enabled = true
}
}
One important recommendation is to explicitly specify the timezone in the MySQL JDBC URL:
serverTimezone=Asia/Shanghai
This helps prevent timezone-related inconsistencies when extracting DATETIME and TIMESTAMP fields.
The username and password should belong to an account with replication permissions, capable of reading binlogs and accessing all tables involved in synchronization.
In most production environments, each synchronization task is typically configured for a specific database. Therefore, it is recommended to explicitly specify the database being synchronized:
database-names = ["test"]
Table Selection
SeaTunnel provides two approaches for selecting source tables:
Explicit Table List
table-names = [
"test.test_001",
"test.test_002"
]
Regular Expression Matching
table-pattern = "test\\.test_.*"
For large-scale initial synchronization scenarios involving many tables—or even an entire database—regular expressions are often the preferred option.
When using regular expressions, both the database name and table name must be included.
For example:
test\\.test_.*
The escaped dot (\\.) represents the literal separator between the database name and table name.
The pattern above matches all tables in the test database whose names begin with:
test_
This approach is especially useful for large-scale database synchronization projects.
Table-Level Custom Configuration
Additional table-specific settings can also be defined.
For example:
table-names-config = [
{
"table":"test.test_002",
"primaryKeys":["id"]
}
]
Suppose test_002 does not have a physical primary key.
In that case, a logical primary key can be specified manually to support synchronization and downstream processing.
Startup Mode
One of the most important options is:
startup.mode = "initial"
This is also the most common production configuration.
The initial mode performs:
- Historical full synchronization
- Continuous CDC synchronization
In other words:
Full Load + Incremental CDC
This approach is widely used when onboarding existing databases into real-time synchronization pipelines.
Snapshot Configuration
The following options control snapshot processing:
snapshot.split.size = "8096"
snapshot.fetch.size = "1024"
The default values work well in most cases.
For larger clusters or more powerful servers, these values can be adjusted to improve throughput.
Server ID Configuration
Another critical parameter is:
server-id = "6500-8500"
When SeaTunnel consumes MySQL binlogs, it behaves like a MySQL replica node.
MySQL replication requires every replica to have a unique server-id.
If not specified, a default value will be used.
However, the official recommendation is to configure a dedicated server-id range.
An important requirement is:
The server-id range must be larger than the configured parallelism; otherwise, synchronization tasks may fail during startup.
Connection Settings
The following parameters control connectivity:
connect.timeout.ms = 30000
connect.max-retries = 3
connection.pool.size = 20
For large datasets or slower networks, increasing timeout values may be beneficial.
Similarly, connection pool size can be increased for high-throughput synchronization workloads.
Exactly-Once Semantics
exactly_once = false
For most CDC analytics scenarios, strict transactional consistency is not required.
Therefore, disabling Exactly-Once semantics is often recommended because it can significantly improve synchronization performance.
If strong consistency guarantees are required by the business, this option can be enabled.
However, enabling Exactly-Once generally introduces additional overhead and may reduce throughput.
Schema Evolution
Another highly recommended feature is:
schema-changes.enabled = true
Schema Evolution allows SeaTunnel to automatically adapt to changes in source table structures.
For example:
- Add Column
- Drop Column
- Rename Column
- Modify Column
This reduces the need to manually modify synchronization jobs whenever schema changes occur.
However, it also introduces certain considerations.
For example, if a downstream application depends on a column name that is automatically renamed, related SQL statements may fail.
Therefore, users should balance automation and downstream compatibility according to their own requirements.
According to the official documentation, Schema Evolution currently supports:
- add column
- drop column
- rename column
- modify column
Not all DDL operations are supported.
For example:
- CREATE TABLE
- DROP TABLE
cannot currently be captured and propagated automatically.
Despite these limitations, Schema Evolution remains an extremely valuable capability and is highly recommended for most production environments.
PostgreSQL Sink Configuration
sink {
jdbc {
url = "jdbc:postgresql://192.168.8.101:5432/test"
driver = "org.postgresql.Driver"
user = "postgres"
password = "123456"
generate_sink_sql = true
database = "test"
table = "${database_name}.${table_name}"
schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"
data_save_mode = "APPEND_DATA"
# enable_upsert = false
}
}
The Sink configuration writes synchronized data into PostgreSQL.
In addition to the connection URL, JDBC driver, username, and password, one particularly useful feature is:
generate_sink_sql = true
When enabled, SeaTunnel automatically generates:
- CREATE TABLE statements
- INSERT statements
- DELETE statements
- UPDATE statements based on primary keys
This greatly simplifies synchronization configuration and eliminates the need for complex manual SQL development.
Understanding PostgreSQL Schema Hierarchy
When synchronizing data to PostgreSQL, it is important to understand the difference between MySQL and PostgreSQL object hierarchies.
MySQL contains:
Database → Table
PostgreSQL contains:
Database → Schema → Table
Therefore, if all synchronized tables should be stored in a specific PostgreSQL database, this must be configured appropriately.
It is also recommended that the PostgreSQL user have table creation permissions if automatic table creation is enabled.
Placeholder Support
SeaTunnel provides a powerful placeholder mechanism:
table = "${database_name}.${table_name}"
This feature is especially useful when synchronizing many tables.
Instead of manually defining every target table name, placeholders automatically generate target table mappings based on source metadata.
This significantly reduces maintenance effort in large-scale synchronization projects.
Save Modes
Schema Save Mode
schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"
This option is extremely useful for whole-database synchronization.
It automatically creates target schemas and tables when they do not already exist.
As a result, users can avoid many manual setup steps.
Data Save Mode
data_save_mode = "APPEND_DATA"
APPEND_DATA prevents existing synchronized data from being overwritten.
This makes it one of the safest and most commonly used modes in production environments.
Other save modes are also available and can be selected according to business requirements.
Upsert Configuration
enable_upsert
If you can guarantee that source-side data will never contain duplicate primary keys, disabling upsert operations may significantly improve synchronization performance.
However, if duplicate records are possible, it is recommended to keep Upsert enabled.
SeaTunnel can then perform primary-key-based updates automatically.
Refer to the official documentation for detailed parameter descriptions and supported behaviors.
Job Submission and Monitoring
After completing the configuration file, the synchronization job can be submitted using SeaTunnel's command-line tools.
Example:
./bin/start-seatunnel.sh --config /path/to/config.yaml --async
Parameter description:
-
config— Specifies the configuration file path. -
async— Submits the job asynchronously. After submission, the command-line process exits immediately while the job continues running in the background.
After submission, the job can be monitored through the SeaTunnel UI.
Starting from version 2.3.9, SeaTunnel provides an intuitive web interface that allows users to view:
- Job status
- Execution logs
- Throughput statistics
- Data processing metrics
Synchronization Demonstration
In this demo, two tables were created:
test_001
test_002
After inserting sample records into MySQL, SeaTunnel successfully synchronized the data into PostgreSQL.
The demonstration also covers:
- INSERT operations
- DELETE operations
- UPDATE operations
- Schema changes
SeaTunnel successfully captures and synchronizes all these changes to PostgreSQL in real time.
Key Takeaways
Automatic Schema Synchronization
SeaTunnel supports automatic schema synchronization.
When the schema of a source MySQL table changes, the corresponding PostgreSQL table structure can be updated automatically as well.
Data Consistency
SeaTunnel ensures data consistency throughout the synchronization process.
All INSERT, DELETE, and UPDATE operations are accurately replicated to the target database, providing a reliable foundation for real-time analytics and data integration scenarios.
Top comments (0)