DEV Community: Apache SeaTunnel

Still maintaining hundreds of Python ETL scripts? 🤯 Build pipelines, not infrastructure. Let Apache SeaTunnel handle the runtime while you focus on data. 🚀

Apache SeaTunnel — Fri, 17 Jul 2026 02:29:53 +0000

Apache SeaTunnel

Jul 17

From Python Script Hell to a Modern Data Integration Framework

11 min read

From Python Script Hell to a Modern Data Integration Framework

Apache SeaTunnel — Fri, 17 Jul 2026 02:29:25 +0000

Every Data Team Eventually Ends Up with a Collection of Python Scripts

Almost every enterprise data platform follows a similar evolution.

At the beginning of a project, data ingestion requirements are usually straightforward. Business systems need to synchronize MySQL data into a data warehouse. Marketing teams want to periodically retrieve campaign data from third-party REST APIs. Logging systems consume real-time messages from Kafka before writing them into ClickHouse or Elasticsearch. For these scenarios, Python naturally becomes the preferred language for most data engineers because of its rich ecosystem and low development cost. With just a few dozen or a few hundred lines of code, a complete data synchronization task can be implemented.

At this stage, the approach works perfectly well. Development is fast, deployment is simple, and a new requirement usually means adding another Python file that can be delivered to production within a short time.

The real challenge emerges as the business continues to grow.

As more data sources are introduced, organizations gradually accumulate dozens, hundreds, or even thousands of synchronization jobs. Git repositories become filled with Python scripts written by different developers, following different coding styles, and using different execution models. When another synchronization task is required, the easiest solution is rarely to redesign the architecture. Instead, developers simply copy an existing script, modify the database connection, update the SQL statements and destination table, and deploy yet another program.

Over time, the data platform gradually evolves into nothing more than a large collection of scripts.

Many engineering teams refer to this phenomenon as "Python Script Hell." However, the problem is not Python itself. The real issue is that organizations have unintentionally treated Python scripts as their data integration platform.

What Enterprises Keep Rebuilding Is Actually a Runtime Framework

If you take a closer look at these Python scripts, you'll find that only a small portion of the code is truly business-specific.

A typical data synchronization task usually consists of just three steps: reading data, transforming data, and writing data. Whether the task synchronizes orders, customer information, or application logs, the business logic usually differs only in the data source, transformation rules, and destination system.

However, a production-ready Python script involves much more than these three steps. Developers must establish database connections, manage thread pools, handle network failures and retry logic, persist synchronization checkpoints, generate logs and monitoring metrics, prevent duplicate writes, support task recovery, and properly release resources throughout the execution lifecycle.

In other words, while organizations appear to be developing new data synchronization jobs every day, what they are actually rebuilding repeatedly is not business logic, but an entire set of runtime infrastructure.

More importantly, this infrastructure is almost identical across different projects.

Connection management, thread scheduling, retry mechanisms, checkpointing, state recovery, logging, monitoring, and write consistency are implemented repeatedly in one project after another. These capabilities should be provided by a common platform, yet they are instead embedded in individual Python scripts, resulting in duplicated code and increasing maintenance costs.

Therefore, the real challenge is not reducing the use of Python. It is stopping the repeated implementation of runtime capabilities that should belong to a shared framework.

This is precisely the problem that Apache SeaTunnel is designed to solve.

Apache SeaTunnel: Replacing Repetitive Frameworks, Not Python

When people first encounter Apache SeaTunnel, they often think of it as just another ETL tool. They assume that it simply replaces Python code with configuration files.

However, that is only the surface.

What SeaTunnel fundamentally changes is not the programming language—it redefines the boundary between business logic and runtime capabilities.

In a traditional Python-based approach, a script is responsible not only for describing how data flows, but also for handling every aspect of execution, including connection management, task scheduling, exception recovery, state management, and many other runtime concerns. Every script effectively becomes a miniature data integration platform with its own lifecycle.

SeaTunnel takes a completely different approach.

Developers are responsible only for defining the data pipeline, while the runtime is responsible for executing that pipeline reliably, efficiently, and at scale.

As a result, a data synchronization task only needs to answer three questions:

Where does the data come from? (Source)
How should the data be processed? (Transform)
Where should the data be written? (Sink)

Everything else that is related to execution is handled by the unified runtime.

This is the fundamental difference between SeaTunnel and traditional Python scripts: developers build pipelines, while the runtime handles everything else.

Source, Transform, and Sink: Standardizing More Than Configuration

Many articles describe Source, Transform, and Sink simply as configuration concepts. In reality, these three components are much more than that—they form the core abstraction of the entire SeaTunnel runtime architecture.

Traditionally, different data synchronization tasks are implemented in completely different ways. Synchronizing data from MySQL requires database access logic, consuming Kafka messages requires managing consumers, and calling REST APIs involves authentication, pagination, and rate limiting. Although all of these tasks ultimately perform the same operation—reading data—each data source requires its own implementation.

SeaTunnel does not attempt to standardize the underlying systems themselves. Instead, it standardizes the data flow.

Whether the data comes from MySQL, Oracle, Kafka, MongoDB, Amazon S3, or a REST API, it is represented as a Source once it enters a SeaTunnel pipeline. Whether the data needs SQL processing, field mapping, type conversion, or data cleansing, those operations are handled uniformly through Transform. Finally, regardless of whether the destination is ClickHouse, StarRocks, Iceberg, Kafka, or Elasticsearch, all writes are managed through Sink.

The greatest value of this abstraction is that it shifts the developer's focus back to the data itself instead of the implementation details.

For an enterprise, creating a new synchronization task no longer means copying an existing Python project. It simply means defining another data pipeline. While business logic continues to evolve, the development model remains consistent, providing the foundation for managing data integration at scale.

Connector Framework: Turning Connectivity into a Platform Capability

In a traditional Python project, every new data source usually introduces another set of connection logic.

Connecting to MySQL requires maintaining database drivers and connection pools. Calling REST APIs requires handling authentication tokens, token refresh, and rate limiting. Consuming Kafka messages requires managing consumers, partitions, and offsets. As more data sources are introduced, these implementations become scattered across different scripts. Although they solve essentially the same problem, they are difficult to reuse in practice.

SeaTunnel encapsulates these capabilities within a unified Connector Framework.

A connector is responsible not only for establishing connections, but also for reading data, adapting communication protocols, parsing schemas, and writing data into target systems. All connectors follow the same lifecycle and interface specifications. As a result, developers no longer need to design different connection mechanisms for different systems—they simply select the appropriate connector and provide the necessary configuration.

More importantly, connector standardization makes the platform continuously extensible.

When an enterprise needs to integrate another database, cloud service, or storage system, it no longer needs to create another standalone Python project. Instead, it extends the platform by implementing a new connector that operates within the existing framework.

The runtime remains the same. Only the data access capability is extended.

This is the fundamental distinction between a framework and a collection of scripts.

SeaTunnel Runtime: What Gets Reused Is the Runtime, Not the Code

If Source, Transform, and Sink standardize the data flow, then the Runtime is what transforms SeaTunnel from an ETL tool into a true Data Integration Framework.

When writing Python scripts, developers naturally focus on how data is read and written. However, once a synchronization task enters production, reading and writing data is no longer the most challenging part.

The real challenge is ensuring that the pipeline continues to run reliably over time.

A production data synchronization job must address a wide range of runtime concerns. Tasks need to be partitioned to improve throughput, multiple jobs need to be scheduled concurrently, failures must be detected and recovered automatically, execution state has to be persisted, duplicate writes must be prevented, and the entire execution process must be observable.

These problems have very little to do with business logic, yet they determine whether a data platform can operate reliably in production.

In the traditional model, every Python script implements these capabilities independently, leaving each developer responsible for building and maintaining their own runtime infrastructure.

SeaTunnel takes a different approach by consolidating these responsibilities into a unified runtime.

As a result, developers no longer need to build a separate execution environment for every synchronization task. Instead, different data pipelines run on the same runtime, sharing a common execution model and platform capabilities.

In other words, what is truly reused is not application code, but the runtime itself.

Concurrency Scheduling: From Managing Thread Pools to Configuring Parallelism

As data volumes continue to grow, most Python-based projects eventually encounter the same challenge: a single-threaded execution model is no longer sufficient to meet throughput requirements.

To address this, some teams introduce thread pools, others adopt the multiprocessing module, while some experiment with asyncio. Over time, different projects evolve different concurrency models. As the number of scripts increases, thread management, resource contention, and troubleshooting become increasingly difficult.

The fundamental problem is that concurrency is a runtime concern, not a business concern.

From a developer's perspective, the real question is "How quickly should this task complete?", not "How many threads should be created?", "How should they be scheduled?", or "How should resources be allocated?"

SeaTunnel Runtime delegates task partitioning, resource scheduling, and concurrent execution to the execution engine. Developers only need to configure an appropriate level of Parallelism based on workload requirements. The Runtime is responsible for partitioning the pipeline, allocating resources, scheduling execution, and managing the entire task lifecycle, without requiring developers to maintain thread pools or asynchronous frameworks.

This design significantly reduces development complexity while providing a consistent execution model across the entire platform. When an enterprise needs to improve synchronization performance, it adjusts platform-level configurations rather than modifying hundreds of individual Python scripts.

Checkpoint and State: Why State Management Belongs in the Framework

Compared with an initial full synchronization, enterprises are far more concerned with how a task recovers after a failure.

Consider a synchronization job that has already processed tens of millions of records before unexpectedly terminating because of a network issue or node failure. Restarting the entire job from scratch is both inefficient and likely to produce duplicate data. On the other hand, if each developer maintains synchronization progress independently, different projects inevitably end up using different formats and recovery mechanisms.

Many Python-based solutions persist the last synchronized primary key, update timestamp, Kafka offset, or file position. Although this approach works for an individual project, state management becomes increasingly fragmented as more synchronization jobs are introduced, making recovery logic difficult to standardize across the platform.

SeaTunnel treats Checkpoint and State as core runtime capabilities rather than application logic. During execution, the framework periodically persists the execution state. If a task fails, it can automatically recover from the latest valid checkpoint and continue processing without requiring developers to manually maintain offsets, timestamps, or synchronization markers.

The value of this unified state management extends beyond reducing code. More importantly, it ensures that every synchronization task follows the same recovery mechanism. State becomes a platform-managed resource instead of being scattered across databases, local files, or Redis instances.

Retry and Fault Tolerance: Recovery Should Be a Platform Responsibility

Almost every Python script contains similar logic:

Catch an exception, wait for a few seconds, and try again.

As systems grow more complex, retry logic becomes increasingly sophisticated. Some tasks retry three times before failing, others retry indefinitely, while some require manual intervention after specific errors. Different developers inevitably implement different fault tolerance strategies.

When an organization operates hundreds of synchronization jobs, these inconsistencies eventually translate into operational complexity.

SeaTunnel incorporates Retry, Failover, and Fault Tolerance directly into the Runtime. When a task encounters an exception, the framework determines whether the task should be retried, how it should be recovered, and how it should be rescheduled according to a unified execution policy, rather than leaving these decisions to individual applications.

For developers, this eliminates the need to repeatedly implement exception handling and retry logic in every synchronization job. For platform operators, it ensures that every task follows the same execution behavior and recovery strategy.

This is one of the most important distinctions between a framework and a collection of scripts: fault recovery is a platform capability, not application logic.

Write Semantics: Reading Data Is Easy—Writing It Correctly Is the Real Challenge

One of the most frequently overlooked aspects of data synchronization is not reading data, but ensuring that data is written correctly to the target system.

Consider a task that has written half of its records to the destination database before unexpectedly terminating. If the task restarts and writes those records again, duplicate data may be generated. If it skips the already processed portion incorrectly, data may be lost. To avoid these situations, many Python projects implement their own idempotent write logic, transaction control, or deduplication mechanisms.

These implementations are not only complex but also difficult to standardize across different projects.

SeaTunnel moves write semantics into the Sink Connector and the Runtime, allowing write consistency to be managed in a unified manner. Depending on the capabilities of the target system, different Sink connectors provide appropriate consistency guarantees, while the Runtime coordinates the writing process instead of requiring every synchronization program to implement its own strategy.

As a result, developers only need to specify where the data should be written. The Runtime is responsible for how it is written safely and consistently.

Once data consistency becomes a platform capability, organizations no longer need to maintain multiple implementations of transaction management or idempotent writes. Instead, they rely on a unified write mechanism shared across the entire platform.

Observability: Operating a Platform Instead of Hundreds of Scripts

As the number of synchronization jobs continues to grow, observability becomes an essential capability of any production-grade data platform.

In a traditional environment, every Python script typically has its own logging format, monitoring mechanism, and alerting strategy. When an issue occurs, operations teams often have to inspect individual log files one by one, and in some cases they may not even know which script is responsible for the failure.

SeaTunnel Runtime exposes a unified set of operational metrics, including task status, throughput, latency, resource utilization, checkpoint information, and other runtime indicators. These metrics can be integrated with an organization's existing monitoring infrastructure, providing a consistent operational view across all synchronization jobs.

For operations teams, the focus shifts from managing hundreds of independent programs to operating a single, unified data integration platform.

This transformation goes beyond improving monitoring. It gives the platform a consistent observability model that supports large-scale operations, troubleshooting, and capacity planning.

From Script-Driven Development to Framework-Driven Engineering

Looking back at the evolution of enterprise data platforms, it becomes clear that Python has never been the problem.

In the early stages of a project, Python scripts enable teams to build data ingestion pipelines quickly, reduce development costs, and deliver business value with minimal overhead. They provide an efficient way to establish the first generation of a data platform.

The real challenge emerges as the platform grows.

When the number of synchronization jobs increases from a handful to hundreds, organizations are no longer maintaining only business logic. Instead, they are maintaining hundreds of independent implementations of connection management, concurrency scheduling, state persistence, retry mechanisms, monitoring, and other runtime capabilities.

Apache SeaTunnel is valuable not because it reduces the amount of code developers need to write, but because it redefines the responsibilities of a modern data integration platform.

By introducing a unified Source–Transform–Sink model, SeaTunnel standardizes how data pipelines are described. Through a unified Connector Framework, it abstracts the differences among heterogeneous data systems. More importantly, through a unified Runtime, it elevates capabilities such as parallel execution, checkpointing, state management, retry mechanisms, write semantics, and observability from application logic into platform capabilities.

For developers, creating a new synchronization task no longer means copying an existing Python script and modifying it. Instead, it simply means defining another pipeline.

For enterprises, the maintenance target is no longer hundreds of isolated scripts, but a single, standardized data integration framework.

The number of data synchronization tasks may continue to grow, but the amount of infrastructure that must be repeatedly implemented and maintained is dramatically reduced.

This is the most significant value that Apache SeaTunnel brings.

It is not designed to replace Python. Rather, it allows Python to focus on solving business problems instead of carrying responsibilities that belong to the framework.

As enterprise data platforms continue to evolve toward greater scale, standardization, and engineering maturity, the transition from script-driven development to framework-driven engineering is no longer simply a tooling upgrade—it is a natural evolution of modern data engineering.

Replace four data platforms with one! 🚀 See how Tongcheng Travel unified batch & streaming pipelines using Apache SeaTunnel.

Apache SeaTunnel — Fri, 17 Jul 2026 02:22:44 +0000

Apache SeaTunnel

Jul 17

From Four Platforms to One: How Tongcheng Travel Built a Unified Data Integration Platform with Apache SeaTunnel

#apacheseatunnel #datascience #database #opensource

13 min read

From Four Platforms to One: How Tongcheng Travel Built a Unified Data Integration Platform with Apache SeaTunnel

Apache SeaTunnel — Fri, 17 Jul 2026 02:19:14 +0000

From four independent data integration platforms to one unified batch-stream architecture—discover how Tongcheng Travel leveraged Apache SeaTunnel's Zeta Engine to migrate tens of thousands of production jobs with zero business impact.

For years, Tongcheng Travel's data integration ecosystem evolved into four separate platforms: Data Transfer, Data Integration, Sqoop, and Apache SeaTunnel. While each platform served specific business needs, overlapping functionality, fragmented execution engines, and increasing operational complexity gradually became major challenges.

To address these issues, Tongcheng Travel adopted Apache SeaTunnel Zeta Engine as the unified foundation for its next-generation data integration platform.

This article shares the complete migration journey—from automatically translating legacy Sqoop and FlinkSQL jobs without requiring code changes, to building an AI-powered Data Copilot capable of creating data pipelines using natural language, to designing a large-scale data consistency validation framework that guarantees migration accuracy.

Ultimately, the project successfully consolidated four independent platforms into a unified batch-stream processing architecture while significantly improving operational efficiency, platform maintainability, and future scalability. Looking ahead, the team is continuing to invest in cloud-native architecture and AI-driven intelligent operations.

Meetup Recording

https://youtu.be/b224ISVIU7A?si=eOxLZ1xyUAfdOUVs

About the Author

Xiaochen Zhou is a Data Platform Engineer at Tongcheng Travel and an Apache SeaTunnel Committer. He has been actively contributing to the SeaTunnel community, focusing on engine evolution, connector development, and large-scale enterprise adoption.

Building a Unified Data Integration Platform

Existing Architecture

Tongcheng Travel previously operated four independent data integration services:

Data Transfer
Data Integration
Sqoop
Apache SeaTunnel

Although these systems were built for different scenarios over time, they gradually evolved to provide overlapping capabilities, resulting in duplicated functionality, inconsistent architectures, and increasing maintenance costs.

Specifically:

Data Transfer and Sqoop were built on Flink 1.6 and MapReduce, respectively. They mainly handled offline synchronization between relational databases and big data systems. However, when processing large-scale database synchronization jobs, these engines generated significant pressure on production databases, affecting the stability of online business systems.
Data Integration, based on Apache Flink, focused on real-time synchronization scenarios, primarily moving operational database data into the company's data lake.
Meanwhile, the company's healthcare business had already adopted Apache SeaTunnel Zeta Engine. Thanks to its lightweight architecture designed specifically for data integration workloads, SeaTunnel demonstrated superior resource efficiency and synchronization performance.

Overall, maintaining four different data integration platforms had become increasingly unsustainable.

The team decided to consolidate all existing services into a unified data integration platform powered entirely by Apache SeaTunnel, providing a single engine capable of handling both batch and streaming workloads.

Design Goals and Guiding Principles

To ensure the migration could be completed safely and efficiently, Tongcheng Travel established three core principles.

Zero Business Impact

The migration had to be completely transparent to application teams.

Historical jobs should continue running without requiring developers to rewrite existing scripts or learn a new execution model.

Before the final cutover, the platform supports a dual-run validation strategy, allowing both the legacy engine and Apache SeaTunnel to execute the same job simultaneously. Once validation is complete, workloads can be switched seamlessly without any business-side changes.

Guaranteed Data Consistency

Data quality is the foundation of every data integration platform.

A comprehensive validation and fallback mechanism must ensure that migrated jobs produce results identical to those generated by the legacy engines, guaranteeing consistency before, during, and after migration.

Better Performance and Stability

Legacy MapReduce-based pipelines were heavyweight, while Flink 1.6 introduced significant load on production databases and lacked many modern optimizations.

By standardizing on Apache SeaTunnel Zeta Engine, Tongcheng Travel aimed to leverage its lightweight execution model and high-performance architecture to improve synchronization efficiency while reducing infrastructure costs.

Automatic Task Migration

After defining the overall architecture, the team's biggest challenge became clear:

How can tens of thousands of production jobs be migrated from Sqoop and FlinkSQL to Apache SeaTunnel without requiring developers to modify a single line of code?

Migrating Sqoop and FlinkSQL to Apache SeaTunnel

For most organizations, upgrading the underlying data integration engine usually requires significant cooperation from application teams.

Tongcheng Travel took a completely different approach.

Their guiding principle was:

Zero business awareness. Zero code changes.

Existing Sqoop scripts and FlinkSQL jobs continue to be submitted exactly as before.

Behind the scenes, the platform transparently converts them into standard Apache SeaTunnel jobs and executes them on the SeaTunnel engine.

Skill Layer

The first layer of the architecture consists of a collection of intelligent "Skill" modules responsible for understanding different job formats.

Legacy production jobs—including tens of thousands of Sqoop scripts and FlinkSQL statements—remain completely unchanged.

Instead, specialized translation components interpret each job type:

Sqoop Skill
FlinkSQL Skill
SeaTunnel Skill

These Skills recognize syntax, configuration parameters, execution semantics, and connector definitions before mapping them into a standard Apache SeaTunnel configuration.

This abstraction layer isolates application developers from changes in the underlying execution engine.

Execution Layer

After translation is complete, the unified submission service generates a standard SeaTunnel job configuration.

To guarantee migration safety, Tongcheng Travel introduced a dual-run execution mechanism.

Whenever a workflow is triggered, the platform actually launches two independent jobs.

Production Baseline Job

The original Sqoop or FlinkSQL task continues running exactly as before.

This guarantees that existing production data delivery remains completely unaffected throughout the migration process.

Apache SeaTunnel Canary Job

At the same time, the translated Apache SeaTunnel job is submitted to a separate execution cluster.

Instead of writing into production storage, the output is written into a validation environment for comparison.

Intelligent Validation and Seamless Cutover

Once both jobs complete, an automated validation framework compares their outputs.

If every validation rule passes successfully, future executions of that workload are automatically routed to Apache SeaTunnel.

From the application's perspective, nothing changes.

The migration happens entirely behind the scenes.

From Natural Language to Apache SeaTunnel

Migrating historical workloads solves yesterday's problems.

The next challenge is enabling future development to become dramatically easier.

Tongcheng Travel wanted business users—not just experienced data engineers—to create data synchronization tasks without needing to understand connectors, SQL dialects, or execution engines.

Their answer is Data Copilot, an AI-powered pipeline generation framework built on large language models.

Intelligent Information Completion

Real-world user requests are rarely complete.

A request such as:

"Synchronize yesterday's core product sales data into the data warehouse."

contains insufficient information for execution.

It doesn't specify:

the source database
the destination table
partition strategy
synchronization mode
connector configuration
execution parallelism

To bridge this gap, Tongcheng Travel implemented an intelligent information completion framework.

The system first extracts the user's intent.

It then combines:

historical user behavior,
enterprise metadata,
business semantics,
existing synchronization patterns,

to automatically infer the missing information.

Finally, the system selects the appropriate Source Connector (for example, the MySQL Core Product table) and Sink Connector (such as Hive), while the SeaTunnel Skill automatically generates low-level execution parameters, including partitioning strategy, parallelism, and connector configuration.

The result is a fully executable Apache SeaTunnel job generated entirely from natural language.

Text2SQL: From Natural Language to SQL

Building complete data pipelines from natural language requires more than simply understanding user intent. The platform must also generate accurate, executable SQL statements that can be seamlessly embedded into Apache SeaTunnel jobs.

To achieve this, Tongcheng Travel designed a multi-stage Text2SQL architecture consisting of Schema Linking, Candidate SQL Generation, and Candidate Selection.

Schema Linking

Schema Linking identifies references to database schemas, tables, columns, and filtering conditions within a user's natural language request.

This step significantly improves both cross-domain generalization and the accuracy of complex SQL generation, making it a foundational component of nearly every modern Text2SQL system.

For example, when a user asks, "Synchronize yesterday's hotel orders," the system automatically identifies the corresponding business tables, date fields, and filtering conditions based on enterprise metadata and schema information.

Candidate SQL Generation

Rather than relying on the output of a single large language model, Tongcheng Travel adopts a multi-generator strategy to maximize SQL generation quality and reliability.

Three independent generators work in parallel.

Reasoning Generator

The Reasoning Generator leverages the zero-shot reasoning capabilities of large language models to generate SQL directly from the user's intent and the underlying database schema.

This approach performs well when encountering entirely new business scenarios without requiring historical examples.

In-Context Learning (ICL) Generator

The ICL Generator adopts a Few-shot Learning strategy.

Instead of generating SQL from scratch, it retrieves previously executed high-quality SQL statements that closely resemble the current request and provides them as examples to the language model.

This significantly improves generation accuracy for recurring business scenarios.

Divide-and-Conquer Generator

Some enterprise SQL statements span hundreds or even thousands of lines and include numerous nested subqueries.

Instead of generating one massive SQL statement, the Divide-and-Conquer Generator decomposes the query into multiple Common Table Expressions (CTEs), generates each subquery independently, and finally assembles them into a complete SQL statement.

This approach dramatically improves generation quality for highly complex analytical queries.

Candidate SQL Selection

Once multiple candidate SQL statements have been generated, they undergo an intelligent evaluation process.

Each candidate is scored based on several dimensions, including:

SQL syntax correctness
Abstract Syntax Tree (AST) complexity
Estimated execution cost
Semantic consistency with the original request

Finally, a dedicated SQL Selector chooses the optimal SQL statement and embeds it directly into the Apache SeaTunnel Transform stage.

This multi-path generation strategy substantially improves both SQL accuracy and execution stability compared to traditional single-model approaches.

Text2ETL: Generating Data Pipelines from Natural Language

Beyond SQL generation, Tongcheng Travel also enables complete ETL workflows to be created through natural language.

Interactive Preview

After processing a user's request, the platform immediately generates a preview of the transformed dataset.

Users can verify whether the generated pipeline matches their expectations before execution, significantly reducing trial-and-error during development.

Intelligent Transform Routing

For standard data transformation requirements, the system automatically maps user intent to Apache SeaTunnel's native Transform plugins, including operations such as:

Filter
Replace
Split
Additional built-in Transform components

Whenever possible, the platform prioritizes native SeaTunnel capabilities to maximize execution performance and maintainability.

Dynamic Compilation as a Fallback

Some enterprise data cleansing scenarios are too complex to be expressed using combinations of built-in Transform operators.

Examples include sophisticated data quality rules, custom parsing logic, or highly specialized business transformations.

In these situations, the platform automatically generates Java, Scala, or Groovy code implementing a User Defined Function (UDF).

The generated code is dynamically compiled in the background and loaded into the Apache SeaTunnel runtime as a plugin without requiring manual packaging or deployment.

This mechanism allows developers to benefit from AI-generated pipeline creation while preserving the flexibility needed for complex enterprise workloads.

Data Consistency Validation During Migration

Automatically converting jobs is only the first step in a successful migration.

The true challenge lies in ensuring that both the legacy engine and Apache SeaTunnel produce exactly the same results.

Because different execution engines may implement serialization formats, timestamp handling, numeric precision, and connector behavior differently, guaranteeing zero data loss and zero data deviation becomes the most critical requirement of the entire migration project.

Tongcheng Travel therefore designed a comprehensive validation framework capable of verifying large-scale production datasets efficiently.

Validation for File-Based Engines

Unlike relational databases, file systems do not provide built-in query capabilities.

Loading terabytes of files into memory for full comparison would be both time-consuming and extremely memory-intensive, potentially leading to OutOfMemory (OOM) errors.

To address this challenge, Tongcheng Travel designed a highly efficient comparison framework based on side indexes and multi-level filtering.

Step 1. Data Reading and Normalization

The system concurrently scans both output directories:

Legacy XData/Sqoop output (baseline)
Apache SeaTunnel Zeta output (candidate)

Each record is read line by line and first normalized to eliminate formatting differences.

After normalization, the entire row is hashed using a deterministic hash algorithm.

Step 2. Building Side Indexes

To quickly locate discrepancies without rescanning original files, the platform builds an ordered external index while computing hash values.

The side index consists of two layers.

File-Level Index

The header records the mapping between a logical file index and its physical storage location.

For example:

fileIndex = 0 → .../part-00000

Record-Level Index

Each normalized record generates an index entry containing:

[hash, fileIndex, lineNumber]

These index entries are globally sorted by hash value using external sorting algorithms.

This preprocessing lays the foundation for highly efficient large-scale comparison.

Step 3. Multi-Level Comparison Algorithm

Instead of comparing every record directly, the validation framework performs comparison in multiple stages.

Level 1: Bloom Filter Screening

Each dataset first generates an independent Bloom Filter.

BloomFilter(XData)

BloomFilter(Zeta)

Because Bloom Filters require very little memory while providing extremely fast lookup performance, they quickly eliminate the vast majority of matching records.

Only potentially inconsistent data proceeds to the next stage.

Level 2: Two-Pointer Comparison

For records that cannot be confirmed through Bloom Filter screening, the system compares the two sorted hash indexes using a classic two-pointer algorithm.

If:

h1 < h2

the baseline dataset contains records missing from Apache SeaTunnel.

The pointer for the baseline dataset advances.

If:

h1 > h2

Apache SeaTunnel contains additional records.

The pointer for the candidate dataset advances.

If:

h1 == h2

the framework performs an additional comparison of the occurrence count for that hash value.

This effectively handles duplicate records while maintaining comparison accuracy.

Step 4. Difference Localization

Whenever inconsistencies are detected, the framework performs a binary search on the side index to rapidly locate the offending hash value.

Using the corresponding:

fileIndex
lineNumber

developers can immediately identify the exact row in the original source file where corruption, truncation, or encoding issues occurred.

This reduces troubleshooting time from hours to seconds.

Step 5. Automatic Cutover

Only after every validation rule passes successfully across the complete dataset does the platform perform the final production cutover, routing future executions entirely to Apache SeaTunnel.

Validation for Database Engines

For relational databases and data warehouses, Tongcheng Travel takes advantage of the database engine itself to calculate data fingerprints, eliminating the need to export large datasets for comparison.

Instead of comparing records one by one, the platform computes hash-based feature values for every column and compares the aggregated results between the legacy engine and Apache SeaTunnel.

SELECT
  sum(murmur_hash3_32(coalesce(CAST(order_serial_no AS string), 'NULL'))) AS order_serial_no_hash,
  sum(murmur_hash3_32(coalesce(CAST(platform_code AS string), 'NULL'))) AS platform_code_hash,
  ...
FROM ...

By comparing the hash signatures of each column, the platform can quickly determine whether two datasets are identical with minimal computational overhead.

Compared with traditional full-table comparison, this approach dramatically improves validation efficiency while maintaining high accuracy, making it well suited for large-scale production migrations.

Improving Performance and Stability

Data consistency guarantees migration correctness, but long-term platform success also depends on execution efficiency and operational stability.

To maximize the performance of the unified data integration platform, Tongcheng Travel optimized Apache SeaTunnel across three key areas:

Connector enhancements
Intelligent parallelism estimation
Improved observability

Connector Enhancements

Since connectors serve as the bridge between Apache SeaTunnel and enterprise data systems, improving connector reliability directly improves overall platform stability.

The team contributed a series of enhancements covering multiple ecosystems.

Enhanced Paimon Connector

The Paimon connector received significant improvements, including:

Full support for the TIME data type
Predicate pushdown for LIKE and BETWEEN operations, reducing unnecessary data scanning
Dynamic table option discovery
Branch writing support for Sink operations
Fixes for DECIMAL precision loss
Resolution of missing fields during DataType conversion
Parallel reading from multiple Paimon Source tables

These enhancements significantly improve both compatibility and execution efficiency for Paimon workloads.

High Availability for OLAP Connectors

The StarRocks and Doris connectors were enhanced to improve availability and reliability.

For StarRocks:

Added Frontend (FE) High Availability support
Randomized FE endpoint selection to eliminate single points of failure

For Doris:

Fixed data loss issues occurring when

request_table_size < BUCKETS

These improvements greatly increase connector resilience in production environments.

Improved Streaming Stability

Several optimizations were introduced for streaming connectors.

For Kafka:

Added partition filtering in Stream Mode to prevent consumer blocking

For RocketMQ:

Added support for skipping malformed records instead of terminating the entire synchronization job

For Milvus:

Fixed missing partition-level Load State validation

These changes improve fault tolerance without sacrificing throughput.

HBase Improvements

The HBase Source connector now supports row-range boundary queries, allowing more efficient incremental extraction and partitioned reads.

HDFS ViewFs Compatibility

Apache SeaTunnel now supports the ViewFs filesystem schema, enabling seamless access to HDFS deployments spanning multiple namespace mount points.

This enhancement improves compatibility with large enterprise Hadoop clusters.

Intelligent Parallelism Estimation

Inspired by Apache Flink's Adaptive Batch Scheduler, Tongcheng Travel implemented an Intelligent Parallelism Estimation mechanism for Apache SeaTunnel Zeta Engine.

Before a job is submitted, the engine performs lightweight metadata analysis to estimate the actual workload.

Depending on the source type, it collects information such as:

Total table row count
Directory size
Kafka partition count
Paimon bucket distribution
Physical storage topology

These physical characteristics are then combined with runtime information, including:

Available cluster resources
Destination throughput limits
Current cluster workload
Resource utilization

Using these inputs, the scheduler dynamically calculates the optimal degree of parallelism and determines the most efficient task partitioning strategy.

Rather than relying on static configuration, computing resources are allocated on demand, allowing Apache SeaTunnel to align task execution with the underlying storage topology.

The result is higher resource utilization, shorter execution times, and improved overall cluster efficiency.

Improved Observability

As the unified platform continued to grow, comprehensive observability became increasingly important.

Tongcheng Travel therefore introduced several monitoring capabilities to improve operational visibility.

Checkpoint Monitoring

Inspired by Apache Flink's mature checkpoint monitoring system, the team built a comprehensive observability framework for Apache SeaTunnel Checkpoints.

Key metrics include:

End-to-end Checkpoint duration
State size
Checkpoint failure frequency
Checkpoint success rate

These metrics help engineers quickly identify bottlenecks affecting fault tolerance and recovery performance.

Master Election Metrics

Borrowing ideas from Apache Kafka's controller election monitoring, Tongcheng Travel also implemented detailed monitoring for Apache SeaTunnel Master elections.

The platform continuously tracks:

Election duration
Master switch frequency
Active Master availability
Election success rate

When abnormal conditions occur—such as split-brain scenarios or frequent Master failovers—the monitoring system immediately generates alerts, helping engineers diagnose underlying network issues or cluster resource bottlenecks before they impact production workloads.

Future Roadmap

With the unified data integration platform now running reliably in production, Tongcheng Travel is focusing on the next stage of its evolution.

Over the next one to two years, development will center around two strategic directions:

Cloud-Native Architecture
AI-Powered Data Integration

Cloud-Native Architecture

The team plans to fully embrace Kubernetes-native resource scheduling.

Future enhancements include:

On-demand Worker provisioning
Automatic Worker lifecycle management
Elastic scaling based on workload
Dynamic resource allocation
Remote Job Log storage following cloud-native logging best practices

These capabilities will further improve infrastructure efficiency while reducing operational overhead.

AI-Powered Data Integration

Artificial intelligence will become a core component of the next-generation data platform.

The long-term vision is to build a closed-loop intelligent operations system capable of:

Generating complete synchronization jobs from natural language
Automatically identifying root causes of failed jobs
Optimizing parallelism based on runtime performance and data latency
Continuously tuning execution strategies without manual intervention

The goal is to transform data integration from a manually operated platform into a self-healing, self-optimizing system powered by AI.

Final Thoughts

A unified data platform is more than an engine upgrade—it's a foundation for future innovation. With Apache SeaTunnel at its core, Tongcheng Travel is building a cloud-native, AI-powered data integration platform ready for the next generation of enterprise data engineering.

🚀 Apache SeaTunnel merged 125 PRs in June! From Control Plane evolution and StateStore abstraction to BigQuery, MQTT & Calcite SQL, the platform keeps getting smarter and stronger. 💪

Apache SeaTunnel — Fri, 10 Jul 2026 02:57:20 +0000

Apache SeaTunnel

Jul 10

Apache SeaTunnel June 2026 Roundup: Control Plane Evolution, Smarter Governance, and a Growing Connector Ecosystem

#apacheseatunnel #opensource #datascience #data

13 min read

Apache SeaTunnel June 2026 Roundup: Control Plane Evolution, Smarter Governance, and a Growing Connector Ecosystem

Apache SeaTunnel — Fri, 10 Jul 2026 02:56:52 +0000

Hi, SeaTunnel Community!

The Apache SeaTunnel June 2026 monthly roundup is here. Throughout June, the project merged 125 pull requests, a significant increase from 87 PRs in May.

More importantly, June wasn't simply about fixing bugs or adding new connectors. Development across the project clearly converged around three major strategic directions:

The control plane and engine abstractions continued to evolve. Features such as table-level fault isolation for multi-table synchronization, the introduction of the StateStore abstraction, task workload monitoring, and reporting of non-terminal job states indicate that Zeta is evolving beyond simply executing jobs toward becoming a platform that is easier to govern, recover, and observe in production.
The Connector-V2 ecosystem continued to expand. New connectors and capabilities—including BigQuery Sink, MQTT Source, Salesforce Source, Vitess CDC, and Google Cloud Bigtable integration—further broadened SeaTunnel's coverage across modern data infrastructure.
The data processing layer became more powerful. Features such as Schema Evolution for file connectors, the Calcite SQL Transform Plugin, and built-in Base64 SQL functions demonstrate that SeaTunnel is no longer focused solely on moving data. It is steadily evolving into a platform capable of performing increasingly sophisticated data transformation and processing.

1. Project Overview

1.1 Development Statistics

Development activity in June can be categorized into four major areas:

New Features: 18
Performance Improvements: 0
Bug Fixes: 40
Architecture Improvements: 67

Compared with May, the most notable change was the substantial increase in architecture-related pull requests.

This signals an important shift in the project's priorities. Rather than focusing solely on shipping new connectors, the community invested heavily in strengthening SeaTunnel's underlying architecture, engineering quality, documentation, stability, and system abstractions. These foundational improvements lay the groundwork for long-term scalability and maintainability.

1.2 Module Distribution

PR distribution across major modules:

Module	PRs
seatunnel-connectors-v2	35
docs	27
seatunnel-engine	22
seatunnel-connectors-v2/connector-cdc	14
seatunnel-e2e	12
seatunnel-api	3
.github/ci	2
Other	10

One particularly noteworthy trend is the significant increase in documentation-related contributions.

Throughout June, the community invested considerable effort into transforming operational knowledge into well-structured documentation. Examples include documentation covering:

Zeta StateStore and recovery mechanisms
CDC production deployment cookbooks
REST API lifecycle documentation
Operational best practices for production environments

While documentation doesn't directly improve throughput or latency, it dramatically enhances SeaTunnel's usability, maintainability, and adoption in enterprise environments. Making operational knowledge explicit lowers the learning curve for new users and enables teams to deploy SeaTunnel more confidently in production.

2. Top Contributors

Rank	GitHub User	Merged PRs	Primary Contribution Categories
1	DanielLeens	27	Features ×1 · Bug Fixes ×5 · Architecture ×21
2	zhangshenghang	26	Features ×1 · Bug Fixes ×11 · Architecture ×14
3	nzw921rx	19	Features ×2 · Bug Fixes ×6 · Architecture ×11
4	davidzollo	7	Bug Fixes ×2 · Architecture ×5
5	dybyte	4	Features ×3 · Architecture ×1
6	yzeng1618	4	Features ×1 · Bug Fixes ×1 · Architecture ×2
7	QuakeWang	4	Bug Fixes ×4
8	JeremyXin	4	Bug Fixes ×2 · Architecture ×2
9	DanielCarter-stack	4	Bug Fixes ×1 · Architecture ×3
10	yuluo-yx	2	Bug Fixes ×2
11	ricky2129	2	Features ×1 · Bug Fixes ×1
12	MyeoungDev	2	Features ×2
13	ss666	2	Features ×1 · Architecture ×1
14	CosmosNi	1	Architecture ×1
15	LeonYoah	1	Bug Fix ×1
16	Marx-Carvalho	1	Architecture ×1
17	hawk9821	1	Bug Fix ×1
18	77amyfly	1	Bug Fix ×1
19	JAEKWANG97	1	Feature ×1
20	NixonWahome	1	Feature ×1
21	GabrielBBaldez	1	Feature ×1
22	niumy0701	1	Architecture ×1
23	15037143579	1	Architecture ×1
24	CloverDew	1	Bug Fix ×1
25	xxzuo	1	Architecture ×1
26	Muktha9491	1	Feature ×1
27	programmerloverun	1	Feature ×1
28	zooo-code	1	Bug Fix ×1
29	zhiliang-wu	1	Feature ×1
30	doyong365	1	Architecture ×1
31	loupipalien	1	Architecture ×1

June also highlighted the growing diversity of the Apache SeaTunnel community.

Contributors from different organizations and regions collaborated across connector development, engine architecture, documentation, testing, and ecosystem improvements. Beyond the number of merged PRs, the sustained focus on architectural evolution demonstrates the community's commitment to building a more robust, scalable, and enterprise-ready data integration platform.

3. Key Technical Highlights

Looking across all the pull requests merged in June, the major technical work can be grouped into four strategic areas.

3.1 Multi-Table Synchronization Enters a New Stage of Reliability

One of the most significant improvements is the introduction of table-level fault isolation for multi-table synchronization. (#10600)

Previously, a failure in a single table could cause the entire synchronization job to fail. With the new fault isolation mechanism, SeaTunnel moves away from an all-or-nothing execution model toward more fine-grained failure handling.

This is an important milestone because multi-table synchronization has become one of SeaTunnel's primary production use cases. As deployments continue to scale, the ability to isolate failures at the table level becomes essential for improving reliability and reducing operational costs.

3.2 File Connectors and SQL Processing Become More Powerful

June also brought substantial improvements to SeaTunnel's data processing capabilities.

The file connector now supports Schema Evolution, enabling runtime handling of schema changes such as column additions, removals, renames, and updates. (#10744)

At the same time, the introduction of the Calcite SQL Transform Plugin provides a far more powerful SQL transformation framework with extensible UDF support. (#11062)

Built-in SQL capabilities were also expanded through new Base64 functions, demonstrating that SQL Transform is evolving beyond simple projection and filtering into a more comprehensive data processing layer. (#11114)

3.3 Zeta Continues Decoupling from Hazelcast IMap

Another major architectural milestone is the introduction of the new StateStore abstraction. (#10812)

Historically, parts of the engine relied directly on Hazelcast's IMap implementation for state management. The new abstraction layer removes this tight coupling by introducing generic state storage interfaces.

This architectural evolution opens the door for future support of alternative state backends, hybrid storage strategies, and more flexible runtime implementations.

3.4 Connector Expansion Continues with a Stronger Focus on Production Readiness

The Connector-V2 ecosystem continued to grow throughout June.

New connectors and integrations include:

BigQuery Sink
MQTT Source
Salesforce Source
Vitess CDC
Google Cloud Bigtable Source
Google Cloud Bigtable Sink

Alongside these additions, the community delivered numerous stability improvements across existing connectors and infrastructure, including fixes for:

Apache Paimon
CDC connectors
JdbcHive integration tests
UI NaN display issues
End-to-End testing stability
JDK 8 compatibility

Rather than simply increasing the number of supported systems, the project is placing greater emphasis on connector maturity, compatibility, and production readiness.

4. In-Depth Analysis of Major Technical Changes

Among all the work completed in June, five pull requests best represent the project's technical direction. Below, we examine each of them from three perspectives:

Background
Implementation
Impact

4.1 PR #10600 — Table-Level Fault Isolation: From Global Failure to Granular Recovery

PR Size

23 files changed

1,648 insertions

144 deletions

Background

Multi-table synchronization has become one of Apache SeaTunnel's core production scenarios.

Previously, if a single table encountered a write failure, the entire synchronization job would often fail.

In large-scale production environments, this creates a classic "long-tail" problem:

Imagine synchronizing 100 tables. If 99 complete successfully while just one fails, the entire job must be rerun, wasting considerable time and computing resources.

As deployments continue to grow, this all-or-nothing failure model becomes increasingly expensive.

Core Design

This pull request introduces new abstractions, including:

MultiTableCommonOptions
MultiTableFailurePolicy

These abstractions expose failure handling as an explicit configuration rather than hiding it inside runtime logic.

Key implementation:

public class MultiTableCommonOptions {

    @Experimental
    public static final Option<MultiTableFailurePolicy> MULTI_TABLE_FAILURE_POLICY =
            Options.key("multi_table.failure_policy")
                    .enumType(MultiTableFailurePolicy.class)

How It Works

Instead of treating every failure as a job-ending event, the runtime can now make decisions based on configurable policies.

Possible behaviors include:

Recording failed tables
Continuing synchronization for healthy tables
Preserving failure context for later recovery

This transforms failure handling from a single execution path into policy-driven runtime behavior.

Impact

Connector Layer

Multi-table sink implementations can now understand failure policies and receive table-level failure metadata.

Engine Layer

The coordinator propagates failure context throughout the job lifecycle, enabling more intelligent recovery behavior.

Users

For production workloads involving hundreds or thousands of similarly structured tables, this significantly reduces rerun costs and improves operational efficiency.

This feature represents a major enhancement to SeaTunnel's production governance capabilities.

4.2 PR #10744 — Schema Evolution for File Connectors

PR Size

4 files changed

168 insertions

2 deletions

Background

Traditional file-based sinks generally have poor tolerance for schema changes.

In CDC pipelines, DDL operations such as adding or renaming columns frequently cause downstream writes to fail or produce unreadable files.

By introducing Schema Evolution into the file connector framework, SeaTunnel enables file-based pipelines to adapt to schema changes automatically instead of treating files as static outputs.

Core Design

A new configuration option has been added:

public static final Option<Boolean> SCHEMA_EVOLUTION_ENABLED =
        Options.key("schema_evolution_enabled")
                .booleanType()
                .defaultValue(false)

When enabled, runtime DDL events—including:

ADD COLUMN
DROP COLUMN
RENAME COLUMN
UPDATE COLUMN

are propagated directly to the file sink.

Rather than mixing incompatible schemas into a single file, SeaTunnel automatically rotates output files whenever a schema boundary is detected.

How It Works

Schema evolution is introduced as an explicit runtime capability through the schema_evolution_enabled option.

Whenever the CDC source emits an ALTER TABLE event, the file sink:

Detects the schema change.
Finalizes the current output file.
Creates a new file using the updated schema.
Continues processing without interrupting the pipeline.

This strategy preserves schema consistency while allowing long-running CDC jobs to continue uninterrupted.

Impact

This feature delivers the greatest value for CDC-to-file scenarios, such as:

MySQL CDC → Parquet
MySQL CDC → ORC

It also makes downstream analytics significantly easier, since each file now corresponds to a single schema version rather than containing mixed record formats.

More broadly, this enhancement brings file connectors much closer to the schema-aware behavior typically associated with modern streaming data platforms.

4.3 PR #11062: Calcite SQL Transform Plugin Brings a More Powerful Transformation Layer to SeaTunnel

PR Size: 40 files changed, 8,257 insertions(+), 1 deletion(-)

This is one of the largest feature contributions merged in June.

Background

SeaTunnel has long provided transformation capabilities between sources and sinks. However, previous transform operators primarily focused on predefined functionality, making it difficult to express complex business logic or extend SQL capabilities through reusable functions.

As enterprise data pipelines become increasingly sophisticated, users expect a declarative SQL layer capable of handling rich transformations, custom business logic, and extensible function libraries without writing additional processing code.

The introduction of the Calcite SQL Transform Plugin represents a significant step toward that goal.

Rather than adding another transform operator, this PR introduces a unified SQL processing layer powered by Apache Calcite, allowing users to perform more sophisticated data transformations using familiar SQL syntax.

Core Design

At the heart of this implementation is the new CalciteUdf SPI, which defines a standard extension mechanism for user-defined functions.

/**
 * SPI for Calcite SQL transform UDFs. Implementations must provide a public static
 * eval method whose signature determines the SQL function's input/output types.
 */
public interface CalciteUdf

The SPI specifies that every UDF must expose a public static eval() method, allowing Calcite's code generation engine to invoke functions directly without creating object instances.

This design minimizes runtime overhead while providing a clean and standardized extension mechanism.

In addition to the SPI itself, the PR also includes comprehensive documentation, end-to-end tests, sample configurations, and supporting infrastructure, making the feature production-ready rather than an experimental prototype.

How It Works

Instead of hardcoding transformation logic into the engine, SeaTunnel now delegates SQL parsing, optimization, and execution to Apache Calcite.

The plugin architecture enables users to:

Write more expressive SQL transformations
Register custom UDFs through the SPI
Extend built-in SQL functions without modifying the engine
Share reusable function libraries across projects

By separating SQL execution from connector logic, SeaTunnel establishes a cleaner architecture in which data movement and data transformation evolve independently.

Impact

For users, this dramatically improves the expressiveness of SQL-based data processing. Complex transformations that previously required custom Java development can now be implemented directly in SQL.

For organizations, the standardized UDF extension mechanism makes it easier to encapsulate business logic into reusable function libraries, improving maintainability across multiple data pipelines.

More importantly, this plugin lays the foundation for future enhancements, including richer built-in functions, enterprise UDF ecosystems, and more advanced SQL optimization capabilities.

It represents another important step in SeaTunnel's evolution from a data synchronization framework to a comprehensive data integration platform.

4.4 PR #10812: StateStore Abstraction Marks a Key Architectural Milestone for Zeta

PR Size: 28 files changed, 2,071 insertions(+), 92 deletions(-)

Background

As distributed execution engines evolve, state management becomes one of the most critical architectural components.

Previously, portions of the Zeta engine relied directly on Hazelcast's IMap implementation for state storage. While this approach simplified the initial implementation, it also introduced tight coupling between engine logic and a specific storage technology.

Over time, this coupling becomes architectural debt.

Supporting alternative state backends, introducing hybrid storage strategies, or implementing specialized capabilities such as TTL management becomes increasingly difficult when upper-layer components depend directly on Hazelcast APIs.

This PR addresses that challenge by introducing a generic StateStore abstraction.

Core Design

Rather than exposing Hazelcast semantics throughout the engine, the implementation introduces a hierarchy of capability-oriented interfaces.

public interface ExpiringStateStore<K, V> extends StateStore<K, V>

Additional interfaces include:

StateStore
ExpiringStateStore
IterableStateStore
Other capability-specific abstractions

Instead of programming against a concrete storage implementation, upper-layer components now depend only on the capabilities they require.

This follows a classic interface-oriented architecture that significantly improves extensibility.

How It Works

The new abstraction separates storage capabilities into independent interfaces.

For example:

Basic key-value storage is defined by StateStore
TTL-aware storage is represented by ExpiringStateStore
Iterable storage capabilities are provided independently
Additional storage features can be introduced without affecting existing implementations

This modular design prevents Hazelcast-specific concepts from leaking into the rest of the execution engine.

As a result, storage implementations become interchangeable while the engine itself remains largely unchanged.

Impact

Although end users may not notice immediate behavioral differences, this is one of the most strategically important architectural improvements merged in June.

The new abstraction establishes the foundation for future capabilities such as:

Alternative state storage backends
Hybrid storage architectures
Independent state services
More sophisticated TTL management
Distributed counters
Enhanced recovery mechanisms

In other words, this PR is less about adding visible features and more about enabling the next generation of Zeta's architecture.

4.5 PR #10485: BigQuery Sink Further Expands SeaTunnel's Cloud Data Warehouse Ecosystem

PR Size: 40 files changed, 3,169 insertions(+), 1 deletion(-)

Background

As cloud-native data platforms continue to gain adoption, seamless integration with major cloud data warehouses has become increasingly important.

For organizations running workloads on Google Cloud Platform, BigQuery is often the analytical database of choice. Until now, exporting data into BigQuery required additional tooling or custom integrations.

The new BigQuery Sink closes this gap by making BigQuery a first-class destination within the SeaTunnel ecosystem.

Importantly, this PR goes far beyond simply adding another connector.

It includes configuration definitions, serialization logic, error handling, documentation, plugin registration, and integration support, providing a complete production-ready implementation.

Core Design

The connector introduces a dedicated configuration class:

public class BigQuerySinkOptions {

    public static final String IDENTIFIER = "BigQuery";

    public static final Option<String> PROJECT_ID =
            Options.key("project_id")

The configuration allows users to specify essential BigQuery settings such as project information while integrating seamlessly with SeaTunnel's existing connector framework.

How It Works

The connector leverages SeaTunnel's unified Connector-V2 architecture, allowing users to write data into BigQuery using the same configuration model shared across the entire connector ecosystem.

This consistent design minimizes the learning curve while simplifying deployment and maintenance across heterogeneous data platforms.

Impact

The addition of the BigQuery Sink further strengthens SeaTunnel's support for modern cloud-native data architectures.

Organizations operating in multi-cloud or hybrid-cloud environments can now integrate BigQuery into their data pipelines more naturally, reducing the need for custom development and improving interoperability across cloud services.

Beyond the connector itself, this contribution reflects the community's continued investment in expanding SeaTunnel's cloud ecosystem while maintaining a consistent and unified user experience.

5. How Should These Improvements Be Evaluated?

Unlike traditional performance-oriented releases, most of the major changes merged in June were not designed to increase throughput or reduce latency. Instead, they focused on improving governance, reliability, extensibility, compatibility, and operational resilience.

As a result, evaluating these enhancements requires a different set of metrics. Looking only at rows per second or execution latency would fail to capture their real value.

Instead, each feature should be validated according to its intended purpose.

Table-Level Fault Isolation (PR #10600)

For table-level fault isolation, the focus should be on operational efficiency rather than raw performance.

Recommended evaluation metrics include:

The percentage of tables that continue to complete successfully when one or more tables fail
The completeness and accuracy of recorded failure information
The reduction in recovery time and rerun costs
The effectiveness of failure isolation in large-scale multi-table synchronization jobs

These indicators better reflect how the feature improves production reliability.

Schema Evolution for File Connectors (PR #10744)

Schema evolution should be evaluated based on pipeline continuity and data correctness.

Key validation metrics include:

Whether CDC pipelines continue running after DDL changes
Whether output files are correctly rotated at schema boundaries
Whether downstream analytics systems can continuously consume generated files
Whether schema versions remain consistent across file partitions

The primary objective is ensuring uninterrupted data delivery while preserving schema consistency.

Calcite SQL Transform Plugin (PR #11062)

For the SQL transformation framework, evaluation should focus on functionality and extensibility rather than execution speed alone.

Important validation areas include:

Expressiveness of complex SQL transformations
Ease of developing and registering custom UDFs
Success rate of end-to-end SQL execution
Compatibility with existing transformation workflows
Stability of plugin loading and function discovery

These measurements demonstrate whether the new SQL layer can support increasingly sophisticated business scenarios.

StateStore Abstraction (PR #10812)

Since the StateStore abstraction is primarily an architectural enhancement, validation should emphasize compatibility and maintainability.

Recommended evaluation metrics include:

Stability of state recovery workflows
Compatibility across different StateStore implementations
Regression testing results after replacing storage backends
Correctness of capability-oriented interface implementations
Overall engine stability during failover and recovery

Ultimately, the success of this change lies in enabling future evolution without disrupting existing functionality.

Taken together, these improvements highlight an important shift in how SeaTunnel should be evaluated.

Rather than focusing exclusively on throughput benchmarks, June's work demonstrates growing maturity in areas such as reliability testing, disaster recovery, configuration compatibility, end-to-end validation, and production governance. These qualities are often the deciding factors for enterprise adoption.

6. Looking Ahead: Where Is Apache SeaTunnel Heading?

The technical work completed in June reveals a clear direction for the project's evolution.

While connector expansion remains important, the community is increasingly investing in the foundational capabilities required by modern enterprise data platforms.

Evolving Beyond a Connector Framework

SeaTunnel has long been recognized for its broad connector ecosystem. However, recent developments show that the project is expanding well beyond data movement.

Features such as Schema Evolution, the Calcite SQL Transform Plugin, table-level fault isolation, and the StateStore abstraction all extend beyond the scope of individual connectors.

Together, they form the building blocks of a more comprehensive data integration platform capable of handling not only data ingestion, but also transformation, governance, recovery, and lifecycle management.

Strengthening Zeta as an Enterprise Execution Engine

The Zeta engine continues to mature into a more intelligent and manageable execution platform.

Recent improvements—including reporting of non-terminal job states, task workload monitoring, StateStore abstraction, enhanced recovery mechanisms, and expanded operational documentation—demonstrate a growing emphasis on observability, resilience, and operational governance.

These capabilities are essential for organizations running large-scale production workloads, where stability and recoverability are just as important as execution performance.

Documentation Is Becoming a Strategic Asset

Another notable trend is the community's increased investment in documentation.

The high proportion of documentation-related pull requests in June does not simply reflect more written content. Instead, it represents a deliberate effort to capture and share operational knowledge that was previously scattered across code, discussions, and individual experience.

Topics such as StateStore architecture, CDC production best practices, recovery workflows, and REST API lifecycle management are now documented in a more systematic way.

For users, this reduces the learning curve and accelerates production adoption.

For contributors, it establishes a stronger foundation for future collaboration.

For the community as a whole, it makes Apache SeaTunnel more accessible, maintainable, and sustainable.

Final Thoughts

With continued innovation across the engine, connectors, and developer experience, Apache SeaTunnel is becoming more powerful, more reliable, and easier to adopt. Thanks to every contributor who helped shape the project throughout June—we look forward to building an even stronger ecosystem together.

🚀 Stream MySQL CDC to any HTTP API with Apache SeaTunnel! Build real-time integrations without direct DB access. ⚡🌐📦 #ApacheSeaTunnel #CDC #DataEngineering #OpenSource #MySQL

Apache SeaTunnel — Fri, 10 Jul 2026 02:17:53 +0000

Apache SeaTunnel

Jul 10

How to Stream MySQL CDC Data to Any HTTP API with Apache SeaTunnel

#mysql #apacheseatunnel #http #datascience

4 min read

How to Stream MySQL CDC Data to Any HTTP API with Apache SeaTunnel

Apache SeaTunnel — Fri, 10 Jul 2026 02:10:37 +0000

If your target system can't connect directly to your database and only accepts data through HTTP APIs, Apache SeaTunnel has you covered.

In this tutorial, you'll learn how to build a real-time data pipeline that captures MySQL changes and pushes them directly to a web application's HTTP endpoint—no custom synchronization service required.

Prerequisites

Install the Required Connectors

Edit the config/plugin_config file and add the following connectors:

connector-http-base
connector-cdc-mysql

Then install the plugins:

sh bin/install-plugin.sh

Tip: If your network environment prevents automatic downloads, you can manually download the corresponding connector JARs from Maven Central and place them in the connectors/ directory.

Add the MySQL JDBC Driver

Copy mysql-connector-java-8.0.28.jar (or any compatible 8.x version) into the SeaTunnel lib/ directory.

Configure MySQL for CDC

SeaTunnel's MySQL CDC connector reads data from the MySQL binary log (Binlog), so Binlog must be enabled before you begin.

Open your MySQL configuration file (my.cnf or my.ini), add the following settings, and restart MySQL:

[mysqld]
server-id = 1
log-bin = mysql-bin
binlog-format = ROW

Configuration details:

server-id: Must be unique within the replication cluster.
log-bin: Enables MySQL binary logging.
binlog-format=ROW: Required for CDC to capture row-level changes accurately.

Build a Simple HTTP Receiver

Next, let's create a lightweight HTTP service to receive data from SeaTunnel.

Create a file named server.go:

package main

import (
    "fmt"
    "io"
    "log"
    "net/http"
)

// Handle all HTTP requests and print the received payload
func handler(w http.ResponseWriter, r *http.Request) {
    defer r.Body.Close()

    body, err := io.ReadAll(r.Body)
    if err != nil {
        msg := fmt.Sprintf("Failed to read request body: %v", err)
        http.Error(w, msg, http.StatusInternalServerError)
        fmt.Println(msg)
        return
    }

    if len(body) > 0 {
        fmt.Printf("Request Body: %s\n", string(body))
        fmt.Printf("Payload Size: %d bytes\n", len(body))
    } else {
        fmt.Println("Empty request body")
    }

    w.WriteHeader(http.StatusOK)
    fmt.Fprintf(w, "Data received successfully. Payload size: %d bytes.", len(body))
}

func main() {
    http.HandleFunc("/", handler)

    fmt.Println("HTTP server started on port 9090...")
    fmt.Println(`Test with: curl -X POST -d '{"test":123}' http://localhost:9090/`)

    if err := http.ListenAndServe(":9090", nil); err != nil {
        log.Fatal("Failed to start server: ", err)
    }
}

Start the service:

go run server.go

Once you see "HTTP server started on port 9090...", your API endpoint is ready to receive data.

Prepare Test Data

Create a sample table in MySQL and insert two initial records:

CREATE TABLE `post` (
  `id` int(11) NOT NULL,
  `content` varchar(50) DEFAULT NULL,
  `author` varchar(50) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

INSERT INTO `post` VALUES (1, 'Getting Started with MySQL', 'Alice');
INSERT INTO `post` VALUES (2, 'Understanding HTTP', 'Bob');

Create the SeaTunnel Job

Create a file named mysqlcdc_http.conf under the job/ directory.

env {
  parallelism = 1
  job.mode = "STREAMING"
  checkpoint.interval = 10000
}

source {
  MySQL-CDC {
    username = "root"
    password = "root"

    table-names = [
      "your_database.post"
    ]

    url = "jdbc:mysql://your-mysql-host:3306/your_database"

    schema-changes.enabled = true
  }
}

sink {
  Http {
    url = "http://your-http-server:9090/"

    headers {
      Accept = "application/json"
      Content-Type = "application/json;charset=utf-8"
    }
  }
}

Before running the job, replace the following with your own environment:

MySQL hostname
Database name
Username and password
HTTP endpoint URL

Start the Streaming Job

From the SeaTunnel installation directory, run:

bin/seatunnel.sh --config job/mysqlcdc_http.conf -m local

SeaTunnel will start capturing MySQL changes and stream them to your HTTP service in real time.

Verify Real-Time Synchronization

After the job starts successfully, you'll immediately see the two existing records being pushed to your HTTP service:

Starting HTTP server on port 9090...

Request Body:
{"id":1,"content":"Getting Started with MySQL","author":"Alice"}
Payload Size: 62 bytes

Request Body:
{"id":2,"content":"Understanding HTTP","author":"Bob"}
Payload Size: 56 bytes

SeaTunnel first performs an initial snapshot, so existing records are synchronized automatically.

Now insert a new record into MySQL:

INSERT INTO `post`
VALUES (3, 'Real-Time Sync with SeaTunnel', 'Charlie');

The HTTP service immediately receives the new data:

Request Body:
{"id":3,"content":"Real-Time Sync with SeaTunnel","author":"Charlie"}

Payload Size: 69 bytes

This confirms that incremental changes are being captured and delivered to your HTTP endpoint in real time.

Conclusion

With Apache SeaTunnel, building a real-time MySQL-to-HTTP data pipeline is straightforward.

Instead of granting downstream systems direct database access, you can stream data securely through standard HTTP APIs. This architecture is especially useful for:

Integrating with third-party SaaS platforms
Connecting internal business systems across departments
Feeding custom web services or microservices
Building event-driven applications
Decoupling data producers from downstream consumers

Whether you're integrating legacy systems, modern web services, or external business platforms, SeaTunnel's CDC and HTTP connectors provide a flexible, scalable, and production-ready solution for real-time data delivery.

Why Fivetran and Airbyte Still Fall Short for Enterprise Data Ingestion

Apache SeaTunnel — Thu, 09 Jul 2026 09:38:06 +0000

In Reddit's r/dataengineering community, discussions about choosing a data ingestion platform never seem to end. Whenever a team is building a modern data stack (MDS) for production, one question inevitably comes up:

Should we choose Fivetran? Airbyte? Or is there a more capable open-source alternative?

After observing countless discussions from data engineers, one thing has become increasingly clear:

No ELT platform can cover every enterprise data ingestion scenario.

However, if you look beyond connector counts and evaluate real-world production environments, the problem runs much deeper. Today's data engineering teams are not simply struggling with missing connectors—they're dealing with architectural limitations, security and compliance requirements, and the reliability challenges of distributed data movement.

The real challenge isn't building another connector.

It's building a data ingestion platform that works reliably at enterprise scale.

Traditional ELT-Based Data Ingestion Is Reaching Its Limits. SeaTunnel Takes a Different Approach.

Across Reddit, experienced data engineers repeatedly describe the same production issues. Whether using commercial SaaS products or open-source ELT tools, the same three pain points continue to surface.

1. Turning Streaming Workloads into Micro-Batches Comes at a High Cost

Many organizations require near real-time synchronization from operational databases such as PostgreSQL or third-party SaaS applications. A 15-minute SLA—or even lower—is becoming the standard rather than the exception.

The challenge：

Traditional ELT platforms are fundamentally built around scheduled micro-batch execution.

To satisfy a 15-minute SLA, they continuously poll databases or repeatedly call APIs. As synchronization frequency increases, organizations often encounter:

Higher CPU utilization on source databases
Expensive API rate-limit penalties (for example, Salesforce)
Long-running historical synchronization jobs
Endless retry loops during large backfills

SeaTunnel's Batch-Stream Integrated Architecture

Instead of forcing streaming scenarios into micro-batches, SeaTunnel provides a unified Batch-Stream Integration API at the ingestion layer.

For production databases such as PostgreSQL, a single ingestion job can seamlessly combine historical loading and CDC streaming.

Automatic Full + Incremental Synchronization

When a job starts:

SeaTunnel first performs a high-speed batch snapshot of historical data.
Once the snapshot completes, it automatically records the latest log position.
The pipeline immediately switches to CDC mode by continuously consuming PostgreSQL WAL or MySQL Binlog changes.

No manual intervention.

No duplicated data.

No synchronization gap.

Eliminating Source Database Pressure

Once incremental synchronization begins, SeaTunnel reads directly from database logs instead of repeatedly scanning tables.

The result is:

Millisecond-level latency instead of 15-minute polling cycles
Minimal CPU overhead on production databases
No excessive API requests
No unnecessary polling traffic

2. PII Should Never Travel Through Your Pipeline Unprotected

Fivetran follows a strict ELT philosophy:

Load first. Transform later.

That means personally identifiable information (PII) is first copied into the warehouse before tools like dbt perform masking or transformations.

For highly regulated industries—including finance, healthcare, and government—this creates an immediate compliance problem.

The challenge：

Sensitive information reaches the warehouse before it is anonymized.

Many organizations are forced to build custom preprocessing services using Python proxies or middleware simply to satisfy internal compliance requirements.

As a result, what should have been a simple ELT deployment becomes increasingly complicated.

SeaTunnel Protects Sensitive Data Before It Lands

As an open-source data ingestion platform, SeaTunnel removes this architectural constraint by introducing lightweight streaming transformations directly into the ingestion pipeline.

In-Flight Data Masking

Before records are written into Snowflake, Iceberg, or any downstream storage, SeaTunnel performs transformations entirely in memory.

Using built-in operators such as:

Replace
Filter
FieldMapper
Custom UDF plugins

Engineers can:

Hash email addresses with SHA-256
Remove sensitive columns
Replace confidential values
Apply enterprise-specific masking logic

Sensitive information is transformed before it ever reaches the destination system.

Compliance is enforced directly inside the ingestion layer rather than being delegated to downstream processing.

That significantly reduces engineering complexity while simplifying security audits.

3. The Long-Tail Connector Trap in Open Source

Airbyte has grown rapidly thanks to its community-driven connector ecosystem.

Connector quantity, however, doesn't necessarily translate into production readiness.

The challenge：

Many long-tail connectors are implemented as lightweight Python wrappers.

They work well for small datasets but often struggle when processing billions of records during historical synchronization.

Without distributed partitioning or parallel extraction capabilities, common production issues include:

Task failures
Pipeline hangs
Memory exhaustion
Constant manual restarts

SeaTunnel Solves This with the Zeta Engine

Rather than stitching together scripting frameworks, SeaTunnel built its own distributed execution engine specifically for data synchronization.

Distributed Data Splitting

Large datasets are automatically partitioned into thousands of parallel splits.

Each split is processed independently across distributed workers, maximizing available bandwidth—even for legacy storage systems.

Checkpointing and Exactly-Once Recovery

Built upon the principles of the Chandy-Lamport distributed snapshot algorithm, the Zeta Engine provides native checkpointing and fault recovery.

If an API rate limit, network interruption, or node failure occurs, SeaTunnel automatically resumes from the last consistent checkpoint.

The result is Exactly-Once data delivery without requiring operators to restart production jobs in the middle of the night.

Reimagining Data Ingestion: How Apache SeaTunnel Builds a Unified Data Movement Layer

Within the open-source community, SeaTunnel is increasingly recognized as a unified data movement platform.

With approximately 200+ connectors covering databases, messaging systems, files, data lakes, data warehouses, search engines, and vector databases, it provides a single platform for enterprise-scale data movement.

Its architecture is fundamentally different from traditional ELT tools.

Instead of optimizing individual connectors, SeaTunnel redesigns the entire data ingestion layer.

1. Unified Abstraction Through SeaTunnelRow

Connecting hundreds of heterogeneous systems traditionally creates an M×N integration problem.

SeaTunnel solves this by introducing a unified internal data model called SeaTunnelRow.

Whether the source contains:

MySQL VARCHAR fields
Elasticsearch Objects
Milvus FloatVectors

all data is converted into a standardized internal representation before processing.

The advantage

Developers only need to build a Source connector once.

That connector can immediately work with every existing Sink supported by SeaTunnel.

This architecture enables rapid support for both legacy enterprise systems and emerging AI infrastructure without rebuilding the entire connector ecosystem.

2. A Distributed Engine Built Specifically for Data Movement

Historically, organizations relied on Spark or Flink to achieve high-performance synchronization.

While powerful, those engines also introduce significant operational complexity through Kubernetes, YARN, and large distributed clusters.

SeaTunnel approaches the problem differently.

Its distributed execution engine —— Zeta Engine is purpose-built for data synchronization.

Key capabilities include:

Distributed Data Splitting

Large datasets are automatically divided into thousands of independent splits executed by distributed TaskGroups, maximizing throughput across databases, files, and cloud storage.

Dynamic Thread Sharing

Traditional synchronization frameworks often dedicate one thread per task.

At enterprise scale—with thousands of tables—that model quickly wastes CPU resources due to excessive context switching.

SeaTunnel introduces Dynamic Thread Sharing, allowing many synchronization tasks to reuse a shared thread pool and dramatically improve resource utilization.

Native Two-Phase Commit and Distributed Checkpointing

Enterprise-grade data movement requires financial-level consistency.

The Zeta Engine natively implements distributed checkpointing together with Two-Phase Commit (2PC), enabling Exactly-Once guarantees without depending on heavyweight external processing engines.

From Another Connector Tool to Open Data Infrastructure

The conversations happening every day on Reddit reveal an important trend.

The next generation of data integration is no longer about:

Better dashboards
More polished UIs
Or adding a few more SaaS connectors

The real challenge lies inside the enterprise.

Organizations need to move data across legacy systems, modern cloud platforms, streaming infrastructure, data lakes, warehouses, and increasingly, AI-native vector databases—while simultaneously meeting strict compliance requirements and real-time SLAs.

Fivetran and Airbyte have undoubtedly transformed modern ELT and made data integration significantly easier.

But as enterprise architectures become increasingly distributed and real-time, the industry is beginning to demand something beyond traditional ELT.

It needs an open, distributed, batch-and-stream integrated data ingestion layer that is designed for production from the ground up.

With its purpose-built distributed Zeta Engine, native batch-stream integration, approximately 200+ connectors, and support for databases, data lakes, warehouses, messaging systems, files, search platforms, and AI vector databases, Apache SeaTunnel is evolving from a synchronization tool into a foundational layer for modern data infrastructure.

References

Apache SeaTunnel GitHub Repository: https://github.com/apache/seatunnel
Apache SeaTunnel Official Website: https://seatunnel.apache.org/
Download Apache SeaTunnel: https://seatunnel.apache.org/download
WhaleOps Official Website: https://www.whaleops.io/

🔑 No more primary key conflicts! Merge data from multiple MySQL databases seamlessly with Apache SeaTunnel. ⚡ #ApacheSeaTunnel #MySQL #CDC #DataEngineering #OpenSource

Apache SeaTunnel — Thu, 02 Jul 2026 03:29:29 +0000

Apache SeaTunnel

Jul 2

How Apache SeaTunnel Eliminates Primary Key Conflicts When Consolidating Data from Multiple Tables

4 min read

How Apache SeaTunnel Eliminates Primary Key Conflicts When Consolidating Data from Multiple Tables

Apache SeaTunnel — Thu, 02 Jul 2026 03:27:22 +0000

⭐ Star Apache SeaTunnel on GitHub:
https://github.com/apache/seatunnel

Overview

In many real-world scenarios, multiple databases may contain tables with similar data. For example, data from different business systems may be stored in tables with the same schema but located in separate databases.

When these tables need to be consolidated into a single table for reporting and analytics, a common challenge arises: because the source tables share the same primary key design, directly merging the data results in duplicate primary keys.

Apache SeaTunnel provides an elegant solution to this challenge. In this article, we'll walk through how Apache SeaTunnel resolves this issue in a simple and effective way.

Solution Design

This solution is designed to synchronize the test table from two independent MySQL databases (source1 and source2) into the test table of a third database (source3) in real time and with high accuracy.

To support real-time synchronization and capture data changes, the solution uses Change Data Capture (CDC) based on the MySQL Binary Log (Binlog).

To prevent duplicate auto-increment primary key IDs from different source tables, the destination table introduces an additional sources column. This column, together with the original id, forms a composite primary key, ensuring both data uniqueness and traceability.

Prerequisites

MySQL 5.7
Apache SeaTunnel 2.3.12

Verify the MySQL Configuration

First, verify that MySQL Binary Logging (Binlog) is enabled.

mysql> show variables where variable_name in ('log_bin', 'binlog_format', 'binlog_row_image', 'gtid_mode', 'enforce_gtid_consistency');

+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| binlog_format            | ROW   |
| binlog_row_image         | FULL  |
| enforce_gtid_consistency | OFF   |
| gtid_mode                | OFF   |
| log_bin                  | ON    |
+--------------------------+-------+

If the value of log_bin is not ON, update the MySQL configuration file (mysql.cnf) as follows:

log-bin=mysql-bin
server-id=1
binlog_format=ROW
binlog_checksum=NONE

Restart the MySQL service after updating the configuration.

Configure Apache SeaTunnel

1. Install the Required Connectors

Add the following connectors to the ${SEATUNNEL_HOME}/config/plugin_config file:

connector-cdc-mysql
connector-jdbc

Then install the connectors.

sh bin/install-plugin.sh

2. Install the MySQL JDBC Driver

This example uses mysql-connector-java-8.0.28.jar.

Copy the JAR file to the ${SEATUNNEL_HOME}/lib/ directory.

Prepare the Test Data

CREATE DATABASE source1 CHARACTER SET utf8mb4;
CREATE DATABASE source2 CHARACTER SET utf8mb4;
CREATE DATABASE source3 CHARACTER SET utf8mb4;

USE source1;

CREATE TABLE `test` (
  `id` INT(11) NOT NULL AUTO_INCREMENT,
  `name` VARCHAR(50) CHARACTER SET utf8mb4,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 CHARACTER SET=utf8mb4;

INSERT INTO test VALUES (1,'张三');
INSERT INTO test VALUES (2,'李四');

USE source2;

CREATE TABLE `test` (
  `id` INT(11) NOT NULL AUTO_INCREMENT,
  `name` VARCHAR(50) CHARACTER SET utf8mb4,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 CHARACTER SET=utf8mb4;

INSERT INTO test VALUES (1,'王五');
INSERT INTO test VALUES (2,'赵六');

USE source3;

CREATE TABLE `test` (
  `id` INT(11) NOT NULL AUTO_INCREMENT,
  `name` VARCHAR(50) CHARACTER SET utf8mb4,
  `sources` VARCHAR(50) CHARACTER SET utf8mb4,
  PRIMARY KEY (`id`, `sources`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 CHARACTER SET=utf8mb4;

Prepare the SeaTunnel Job

env {
  parallelism = 2
  job.mode = "STREAMING"
  checkpoint.interval = 20000
}

source {

  MySQL-CDC {
    plugin_output = "source_data1"
    url = "jdbc:mysql://10.0.12.100:3306/source1"
    username = "root"
    password = "root"
    table-names = ["source1.test"]
    startup.mode = initial
  }

  MySQL-CDC {
    plugin_output = "source_data2"
    url = "jdbc:mysql://10.0.12.100:3306/source2"
    username = "root"
    password = "root"
    table-names = ["source2.test"]
    startup.mode = initial
  }

}

transform {

  Sql {
    plugin_input = "source_data1"
    plugin_output = "result1"
    query = "SELECT *, 'source1' AS sources FROM source_data1"
  }

  Sql {
    plugin_input = "source_data2"
    plugin_output = "result2"
    query = "SELECT *, 'source2' AS sources FROM source_data2"
  }

}

sink {

  Jdbc {

    plugin_input = ["result1","result2"]

    url = "jdbc:mysql://10.0.12.100:3306/source3"

    driver = "com.mysql.cj.jdbc.Driver"

    username = "root"

    password = "root"

    database = "source3"

    table = "test"

    generate_sink_sql = true

    primary_keys = ["id","sources"]

  }

}

This example uses a single configuration to synchronize data from two databases. Alternatively, you can configure two separate synchronization jobs.

Two source connectors are configured: source_data1 and source_data2.

Two SQL transform components are also configured to process the corresponding source connectors.

Each SQL transform assigns a fixed value to the sources field.

The sink consumes the outputs from both SQL transforms and writes the data to the test table. The composite primary key is configured to support update operations.

The destination table must be created manually. Otherwise, the following error will be reported:

BLOB/TEXT column 'sources' used in key specification without a key length

Test

1. Start the Job

bin/seatunnel.sh --config job/mysql_mysql.conf -m local

2. Verify the Data

mysql> select * from test;

+----+--------+---------+
| id | name   | sources |
+----+--------+---------+
|  1 | 张三   | source1 |
|  1 | 王五   | source2 |
|  2 | 李四   | source1 |
|  2 | 赵六   | source2 |
+----+--------+---------+

3. Modify the Source Data

USE source1;

INSERT INTO test VALUES (3,'钱七');

UPDATE test
SET name='张三1'
WHERE id=1;

USE source2;

DELETE FROM test
WHERE id=1;

4. Verify the Data Again

mysql> SELECT * FROM test;

+----+---------+---------+
| id | name    | sources |
+----+---------+---------+
|  1 | 张三1   | source1 |
|  2 | 李四    | source1 |
|  2 | 赵六    | source2 |
|  3 | 钱七    | source1 |
+----+---------+---------+

Summary

The steps above demonstrate the complete process of synchronizing data from tables in two databases into a single table in another database.

I have seen other articles mentioning that the destination table can be created automatically, but I have not been able to reproduce this behavior during my testing.

If you have successfully implemented automatic destination table creation, feel free to share your experience in the comments.

Deploy Apache SeaTunnel 2.3.11 with Docker: A Complete Guide to Syncing Kafka Data to Hive and Elasticsearch

Apache SeaTunnel — Thu, 02 Jul 2026 03:06:23 +0000

This guide walks you through the complete process of deploying Apache SeaTunnel 2.3.11 with Docker. It covers everything from environment setup and dependency installation to configuring Kafka virtual tables, data sources, and building end-to-end data synchronization pipelines from Kafka to Hive and Elasticsearch.

Prerequisites

Project Directory Structure

seatunnel-docker/
├── docker-compose.yml              # Main Docker Compose configuration
├── hive/                           # Hive configuration
│   ├── hive-site.xml
│   └── lib/                        # Required dependency JARs
│       └── postgresql-42.5.1.jar
├── init-sql/                       # Database initialization scripts
│   └── seatunnel_server_mysql.sql
├── seatunnel/                      # SeaTunnel server configuration
│   ├── Dockerfile
│   └── apache-seatunnel-2.3.11/    # Extracted SeaTunnel binary package
│       └── lib/                    # Required dependency JARs
│           ├── hive-exec-3.1.3.jar
│           ├── hive-metastore-3.1.3.jar
│           ├── libfb303-0.9.3.jar
│           ├── mysql-connector-java-8.0.28.jar
│           └── seatunnel-hadoop3-3.1.4-uber.jar
└── seatunnel-web/                  # SeaTunnel Web configuration
    ├── Dockerfile
    └── apache-seatunnel-web-1.0.3-bin/  # Extracted SeaTunnel Web package
        └── libs/                   # Required dependency JARs
            └── mysql-connector-java-8.0.28.jar

Download Apache SeaTunnel

# Apache SeaTunnel 2.3.11
https://dlcdn.apache.org/seatunnel/2.3.11/apache-seatunnel-2.3.11-bin.tar.gz

# Build SeaTunnel Web 1.0.3 from source
git clone https://github.com/apache/seatunnel-web.git
cd seatunnel-web
sh build.sh code

Download Required Dependencies

# Required by the Hive Metastore container (PostgreSQL is used as the Hive metastore database)
https://jdbc.postgresql.org/download/postgresql-42.5.1.jar

# Additional dependencies required for Hive synchronization
# (In practice, only the first three JARs are required)
https://repo1.maven.org/maven2/org/apache/hive/hive-exec/3.1.3/hive-exec-3.1.3.jar
https://repo1.maven.org/maven2/org/apache/hive/hive-metastore/3.1.3/hive-metastore-3.1.3.jar
https://repo.maven.apache.org/maven2/org/apache/thrift/libfb303/0.9.3/libfb303-0.9.3.jar
https://repo1.maven.org/maven2/org/apache/thrift/libthrift/0.12.0/libthrift-0.12.0.jar
https://repo1.maven.org/maven2/org/apache/hive/hive-common/3.1.3/hive-common-3.1.3.jar

Create the Project Directory

Place all downloaded packages and configuration files into the seatunnel-docker directory.

mkdir seatunnel-docker
cd seatunnel-docker

Deploy with Docker

Configure `docker-compose.yml`

version: '3.9'

networks:
  seatunnel-network:
    driver: bridge
    ipam:
      config:
        - subnet: 172.16.0.0/24

services:
  # ===== Hive Services =====
  hive-metastore-db:
    image: postgres:15
    container_name: hive-metastore-db
    hostname: hive-metastore-db
    environment:
      POSTGRES_DB: metastore_db
      POSTGRES_USER: hive
      POSTGRES_PASSWORD: hive123456
    ports:
      - "5432:5432"
    volumes:
      - ./hive-metastore-db-data:/var/lib/postgresql/data
    networks:
      seatunnel-network:
        ipv4_address: 172.16.0.2
    healthcheck:  # Health check
      test: ["CMD-SHELL", "pg_isready -U hive -d metastore_db"]
      interval: 5s
      timeout: 5s
      retries: 10
      start_period: 10s

  hive-metastore:
    image: apache/hive:4.0.0
    container_name: hive-metastore
    hostname: hive-metastore
    depends_on:
      hive-metastore-db:
        condition: service_healthy  # Wait until the database is healthy
    environment:
      SERVICE_NAME: metastore
      DB_DRIVER: postgres
      SERVICE_OPTS: >-
        -Djavax.jdo.option.ConnectionDriverName=org.postgresql.Driver
        -Djavax.jdo.option.ConnectionURL=jdbc:postgresql://hive-metastore-db:5432/metastore_db
        -Djavax.jdo.option.ConnectionUserName=hive
        -Djavax.jdo.option.ConnectionPassword=hive123456
    ports:
      - "9083:9083"
    volumes:
      - ./hive/lib/postgresql-42.5.1.jar:/opt/hive/lib/postgresql-42.5.1.jar
      - ./hive/hive-site.xml:/opt/hive/conf/hive-site.xml
      - ./hive-warehouse:/opt/hive/data/warehouse
    networks:
      seatunnel-network:
        ipv4_address: 172.16.0.3

  hive-server2:
    image: apache/hive:4.0.0
    container_name: hive-server2
    hostname: hive-server2
    depends_on:
      - hive-metastore
    environment:
      HIVE_SERVER2_THRIFT_PORT: 10000
      SERVICE_NAME: hiveserver2
      IS_RESUME: "true"
      SERVICE_OPTS: "-Dhive.metastore.uris=thrift://hive-metastore:9083"
    ports:
      - "10000:10000"
      - "10002:10002"
    volumes:
      - ./hive-warehouse:/opt/hive/data/warehouse
    networks:
      seatunnel-network:
        ipv4_address: 172.16.0.4

  # ===== MySQL =====
  mysql-seatunnel:
    image: mysql:8.0.42
    container_name: mysql-seatunnel
    hostname: mysql-seatunnel
    environment:
      MYSQL_ROOT_PASSWORD: root123456
      MYSQL_DATABASE: seatunnel
      MYSQL_ROOT_HOST: '%'
    ports:
      - "3806:3306"
    volumes:
      - ./mysql_data:/var/lib/mysql
      - ./init-sql:/docker-entrypoint-initdb.d
    networks:
      seatunnel-network:
        ipv4_address: 172.16.0.5
    command: --default-authentication-plugin=mysql_native_password
    healthcheck:
      test: ["CMD", "mysqladmin", "ping", "-h", "localhost"]
      interval: 10s
      timeout: 5s
      retries: 5

  # ===== SeaTunnel Services =====
  seatunnel-master:
    build:
      context: ./seatunnel
      dockerfile: Dockerfile
    image: seatunnel:2.3.11
    container_name: seatunnel-master
    hostname: seatunnel-master
    extra_hosts:
      - "hive-metastore:172.16.0.3"
      - "hive-metastore-db:172.16.0.2"
    environment:
      - SEATUNNEL_HOME=/opt/seatunnel
    command: >
      sh -c "
      cd /opt/seatunnel &&
      exec bin/seatunnel-cluster.sh -r master
      "
    ports:
      - "5801:5801"
    volumes:
      - ./seatunnel/apache-seatunnel-2.3.11/:/opt/seatunnel/
      - ./logs/master:/opt/seatunnel/logs
      # Mount the Hive warehouse directory to ensure data is persisted on the host
      - ./hive-warehouse:/opt/hive/data/warehouse
    networks:
      seatunnel-network:
        ipv4_address: 172.16.0.10

  seatunnel-worker1:
    image: seatunnel:2.3.11
    container_name: seatunnel-worker1
    hostname: seatunnel-worker1
    extra_hosts:
      - "hive-metastore:172.16.0.3"
      - "hive-metastore-db:172.16.0.2"
    environment:
      - SEATUNNEL_HOME=/opt/seatunnel
    command: >
      sh -c "
      cd /opt/seatunnel &&
      exec bin/seatunnel-cluster.sh -r worker
      "
    volumes:
      - ./seatunnel/apache-seatunnel-2.3.11/:/opt/seatunnel/
      - ./logs/worker1:/opt/seatunnel/logs
      # Mount the Hive warehouse directory to ensure data is persisted on the host
      - ./hive-warehouse:/opt/hive/data/warehouse
    depends_on:
      - seatunnel-master
    networks:
      seatunnel-network:
        ipv4_address: 172.16.0.11

  seatunnel-worker2:
    image: seatunnel:2.3.11
    container_name: seatunnel-worker2
    hostname: seatunnel-worker2
    extra_hosts:
      - "hive-metastore:172.16.0.3"
      - "hive-metastore-db:172.16.0.2"
    environment:
      - SEATUNNEL_HOME=/opt/seatunnel
    command: >
      sh -c "
      cd /opt/seatunnel &&
      exec bin/seatunnel-cluster.sh -r worker
      "
    volumes:
      - ./seatunnel/apache-seatunnel-2.3.11/:/opt/seatunnel/
      - ./logs/worker2:/opt/seatunnel/logs
      # Mount the Hive warehouse directory to ensure data is persisted on the host
      - ./hive-warehouse:/opt/hive/data/warehouse
    depends_on:
      - seatunnel-master
    networks:
      seatunnel-network:
        ipv4_address: 172.16.0.12

  seatunnel-web:
    build:
      context: ./seatunnel-web
      dockerfile: Dockerfile
    image: seatunnel-web:1.0.3
    container_name: seatunnel-web
    hostname: seatunnel-web
    extra_hosts:
      - "hive-metastore:172.16.0.3"
      - "hive-metastore-db:172.16.0.2"
    environment:
      - SEATUNNEL_HOME=/opt/seatunnel
      - SEATUNNEL_WEB_HOME=/opt/seatunnel-web
    ports:
      - "8801:8801"
    volumes:
      - ./seatunnel/apache-seatunnel-2.3.11/:/opt/seatunnel/
      - ./seatunnel-web/apache-seatunnel-web-1.0.3-bin/:/opt/seatunnel-web/
      - ./logs/web:/opt/seatunnel-web/logs
      # Mount the Hive warehouse directory to keep the runtime environment consistent
      - ./hive-warehouse:/opt/hive/data/warehouse
    depends_on:
      - seatunnel-master
    networks:
      seatunnel-network:
        ipv4_address: 172.16.0.13

SeaTunnel Configuration

Dockerfile

Create the following Dockerfile for the SeaTunnel service:

FROM eclipse-temurin:8-jdk-ubi9-minimal

WORKDIR /opt/seatunnel/

# Environment variables
ENV SEATUNNEL_HOME=/opt/seatunnel
ENV PATH=$PATH:$SEATUNNEL_HOME/bin

# Expose the cluster communication port
EXPOSE 5801

# Startup command
CMD ["sh", "bin/seatunnel-cluster.sh", "-r", "master"]

Configure `hazelcast-client.yaml`

Edit:

seatunnel/apache-seatunnel-2.3.11/config/hazelcast-client.yaml

Configure the SeaTunnel client to connect to the cluster:

hazelcast-client:
  cluster-name: seatunnel
  properties:
    hazelcast.logging.type: log4j2
  connection-strategy:
    connection-retry:
      cluster-connect-timeout-millis: 3000
  network:
    cluster-members:
      - seatunnel-master:5801

Configure `hazelcast-master.yaml`

Edit:

seatunnel/apache-seatunnel-2.3.11/config/hazelcast-master.yaml

Configure the master node:

hazelcast:
  cluster-name: seatunnel
  network:
    rest-api:
      enabled: false
      endpoint-groups:
        CLUSTER_WRITE:
          enabled: true
        DATA:
          enabled: true
    join:
      tcp-ip:
        enabled: true
        member-list:
          - seatunnel-master:5801
          - seatunnel-worker1:5802
          - seatunnel-worker2:5802
    port:
      auto-increment: false
      port: 5801
  properties:
    hazelcast.invocation.max.retry.count: 20
    hazelcast.tcp.join.port.try.count: 30
    hazelcast.logging.type: log4j2
    hazelcast.operation.generic.thread.count: 50
    hazelcast.heartbeat.failuredetector.type: phi-accrual
    hazelcast.heartbeat.interval.seconds: 2
    hazelcast.max.no.heartbeat.seconds: 180
    hazelcast.heartbeat.phiaccrual.failuredetector.threshold: 10
    hazelcast.heartbeat.phiaccrual.failuredetector.sample.size: 200
    hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis: 100

Configure `hazelcast-worker.yaml`

Edit:

seatunnel/apache-seatunnel-2.3.11/config/hazelcast-worker.yaml

Configure each worker node:

hazelcast:
  cluster-name: seatunnel
  network:
    join:
      tcp-ip:
        enabled: true
        member-list:
          - seatunnel-master:5801
          - seatunnel-worker1:5802
          - seatunnel-worker2:5802
    port:
      auto-increment: false
      port: 5802
  properties:
    hazelcast.invocation.max.retry.count: 20
    hazelcast.tcp.join.port.try.count: 30
    hazelcast.logging.type: log4j2
    hazelcast.operation.generic.thread.count: 50
    hazelcast.heartbeat.failuredetector.type: phi-accrual
    hazelcast.heartbeat.interval.seconds: 2
    hazelcast.max.no.heartbeat.seconds: 180
    hazelcast.heartbeat.phiaccrual.failuredetector.threshold: 10
    hazelcast.heartbeat.phiaccrual.failuredetector.sample.size: 200
    hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis: 100

Install Connector Plugins

If no options appear in the Source component when creating a synchronization job, the required connector plugins have not been installed.

Run the following command to install all supported connector plugins:

cd seatunnel/apache-seatunnel-2.3.11/
sh bin/install-plugin.sh

Hive Configuration

Configure `hive-site.xml`

Edit:

hive/hive-site.xml

Update the configuration as follows:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://hive-metastore:9083</value>
    </property>

    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/opt/hive/data/warehouse</value>
    </property>

    <property>
        <name>metastore.metastore.event.db.notification.api.auth</name>
        <value>false</value>
    </property>
</configuration>

Add Required Dependencies

Place the following JDBC driver in the hive/lib directory.

postgresql-42.5.1.jar

MySQL Configuration

Initialize the Database

Copy the initialization SQL script from the SeaTunnel Web package into the init-sql directory.

cd seatunnel-docker

cp seatunnel-web/apache-seatunnel-web-1.0.3-bin/script/seatunnel_server_mysql.sql \
init-sql/seatunnel_server_mysql.sql

Start the Docker Environment

Build and start all services:

# Build and start all services
docker compose up -d --build

# Open the SeaTunnel Web UI
# Default credentials:
# Username: admin
# Password: admin
open http://localhost:8801

Running Example

After all services have started successfully, log in to the SeaTunnel Web UI and complete the following configuration steps.

Configure the Display Language

Login Page

Open Settings

Change the Language

Select your preferred language from the language settings.

Configure Data Sources

Before creating synchronization jobs, configure the required data sources.

Configure a Kafka Data Source

Create a Kafka connection by providing the cluster address and connection parameters.

(Insert screenshot)

Configure an Elasticsearch Data Source

Configure your Elasticsearch cluster information, including the endpoint and authentication credentials if required.

Configure a Hive Metastore Local Data Source

You can configure the Hive Metastore endpoint using:

thrift://hive-metastore:9083

Configure Virtual Tables

Create a Virtual Table

Follow these steps to create a virtual table:

Navigate to Virtual Tables.
Click Create.
Select an existing data source.
Configure the virtual table properties.
Click Next to define field mappings.
Review the configuration.
Save the virtual table.

Create Synchronization Jobs

Once the data sources and virtual tables are ready, you can build synchronization pipelines using the visual designer.

Kafka → Hive Synchronization

Configure the Job Components

Source

Configure the Kafka source by selecting the previously created Kafka data source.

Field Mapper

Open the Model view to define field mappings between the source and destination schemas.

Sink

Configure Hive as the destination and specify the target database and table.

Kafka → Elasticsearch Synchronization

Configure the Job Components

Source

Configure the Kafka source.

Field Mapper

Configure field mappings in the Model view.

Sink

Configure Elasticsearch as the destination.

Specify the target index and any required connection parameters.

General Workflow for Creating Synchronization Jobs

To create a synchronization job in SeaTunnel Web:

Navigate to Jobs → Synchronization Job Definitions.
Click Create.
Drag or select the Source, Field Mapper, and Sink components to build the pipeline.
Double-click the Source component and select the configured Kafka data source.
Double-click Field Mapper, then open the Model view to configure field mappings.
Double-click the Sink component and configure Hive or Elasticsearch as the destination.
Save the job.
Start the synchronization job.

Important

Before saving the job, make sure to configure the Job Mode.

Otherwise, the job cannot be saved and the following error will be displayed:
job env can't be empty, please change config

Hive Operations

Create a Table

Use one of the following commands to create a table in Hive.

# Access the HiveServer2 container
docker exec -it hive-server2 beeline -u jdbc:hive2://localhost:10000 -e "
CREATE TABLE IF NOT EXISTS default.test_user_data3 (
user_id STRING,
type STRING,
content STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
"

Alternatively, you can create the table in Parquet format, which is recommended for better storage efficiency and query performance.

docker exec -it hive-server2 beeline -u jdbc:hive2://localhost:10000 -e "
CREATE TABLE IF NOT EXISTS default.test_user_data3 (
user_id STRING,
type STRING,
content STRING
)
STORED AS PARQUET;
"

View the Table Schema

Run the following command to verify that the table has been created successfully.

docker exec -it hive-server2 beeline -u jdbc:hive2://localhost:10000 -e "
SHOW TABLES IN default;
DESCRIBE default.test_user_data3;
"

Query Table Data

Run the following command to query the synchronized data stored in Hive.

docker exec -it hive-server2 beeline -u jdbc:hive2://localhost:10000 -e "
SELECT * FROM default.test_user_data3 LIMIT 10;
"

Troubleshooting

Hive Metastore URI Parsing Error

If the following exception is reported:

seatunnel seatunnel-web ERROR [qtp2135089262-20] [MetaStoreUtils.logAndThrowMetaException():166] - Got exception: java.net.URISyntaxException Illegal character in hostname at index 44: thrift://hive-metastore.seatunnel-docker_seatunnel-network:9083

Add static hostname mappings to the corresponding services in docker-compose.yml.

extra_hosts:
   - "hive-metastore:172.16.0.3"
   - "hive-metastore-db:172.16.0.2"

Hive Synchronization Fails with `java.lang.NoClassDefFoundError`

If a Hive synchronization job fails with java.lang.NoClassDefFoundError, ensure that the required dependency JARs are available in:

seatunnel/apache-seatunnel-2.3.11/lib

Required dependencies:

hive-exec-3.1.3.jar
hive-metastore-3.1.3.jar
libfb303-0.9.3.jar

Hive Synchronization Job Completes Successfully but No Data Is Written

If the synchronization job completes successfully but no data is written to Hive, verify that the Hive warehouse directory is mounted correctly in docker-compose.yml.

volumes:
  # Mount the Hive warehouse directory to ensure data is persisted on the host
  - ./hive-warehouse:/opt/hive/data/warehouse

Check Which Worker Executes the Job

You can identify which worker node is executing the synchronization job by reviewing the master log:

./logs/master/seatunnel-engine-master.log

Example:

Task [TaskGroupLocation{jobId=1080750681855361026, pipelineId=1, taskGroupId=2}] will be executed on worker [[seatunnel-worker2]:5801], slotID [2], resourceProfile [ResourceProfile{cpu=CPU{core=0}, heapMemory=Memory{bytes=0}}], sequence [db6b679c-67cc-43b8-b64a-acaa85c2a4c0], assigned [1080750681855361026]