Apache SeaTunnel 2.3.13 is about to be released. As an important transitional version, it significantly improves the stability of the core engine, further completes the CDC capability landscape, and takes a key step toward the AI ETL domain.
Based on an in-depth analysis of the 2.3.13-release branch source code, we summarize the key updates in this upcoming version.
Key Highlights
1. Core Engine: Flink Schema Evolution and Zeta Stability
Flink Engine Supports CDC Schema Evolution (#9867)
This is a long-awaited feature for Apache Flink users. Version 2.3.13 officially implements automatic propagation and adaptation of source-side schema changes (DDL) at the Flink engine layer. This bridges the final gap from CDC Source to the Flink Engine, enabling Flink tasks to handle upstream schema changes as smoothly as the Zeta engine.-
Deep Optimization of the Zeta Engine
- Remote Pagination Query Support (#9951): Significantly improves the response speed and user experience of SeaTunnel UI and REST API in large-scale task scenarios.
- Memory Leak Fix (#10315): Fixes a memory leak when canceling suspended tasks, improving the stability of long-running clusters.
-
Multi-Sink Metrics Fix (#10376): Resolves inaccurate
Write Countdisplay in multi-destination write scenarios.
2. AI ETL: Embracing Unstructured Data
Multimodal Embedding Transform (#9673)
A newMultimodal EmbeddingTransform enables vectorization of text and image data. Combined with Markdown parsing, SeaTunnel can now directly build a full RAG (Retrieval-Augmented Generation) data pipeline from “unstructured documents” to “vector databases”.Elasticsearch Vector Optimization (#10260)
Improves vector parameter support in the Elasticsearch Sink, making it better suited for AI vector storage scenarios.
3. Connector Ecosystem: Multi-table Sync and Type Enhancements
- MongoDB: Significantly enhances multi-table synchronization mode and standardizes schema configuration parameters for non-relational data sources (#10370).
-
HBase: The Sink now supports
DATE,TIME,TIMESTAMP, andDECIMALtypes, and fixes Decimal deserialization issues (#10291). - Hive: Supports configuring multiple Metastore URIs for automatic failover (#10253), and introduces Socket/Connection timeout controls (#10254).
- JDBC / Redshift: Upgrades driver versions to address OOM issues and fixes integer overflow bugs when merging schemas with large fields.
Key Fixes and Optimizations
This release fixes several critical bugs that may affect production stability. Users running high-load environments should pay special attention.
| Component | Type | Issue | Impact |
|---|---|---|---|
| Core | Hang | FakeSource may hang after restore because NoMoreSplits is not sent (#10275) |
High: Resolves tasks that cannot finish in specific scenarios |
| ClickHouse | Leak | Fixes ThreadLocal memory leak in ClickhouseCatalogUtil (#10264) |
High: Prevents off-heap memory overflow in long-running services |
| Redshift | OOM | Upgrades JDBC driver to resolve OOM during large-scale data reads (#10393) | Medium: Improves Redshift synchronization stability |
| HBase | NPE | Fixes NullPointerException when reading empty tables (#10336) |
Medium: Improves robustness in edge cases |
| SSH | Crash | Upgrades jsch library to fix buffer issues (#10298) |
Medium: Improves SFTP/SSH connection stability |
Deep Dive: Building an AI Knowledge Base Data Pipeline
A hidden core theme in 2.3.13 is “Unstructured Data to Vector.” The following demo illustrates how the new features can parse a local Markdown knowledge base and synchronize it to a vector storage system (Console output used as an example).
Scenario
Read technical documentation (Markdown) from a local directory, parse it into structured data by sections, and prepare it for embedding generation.
Configuration File (Demo)
env {
parallelism = 1
job.mode = "BATCH"
}
source {
LocalFile {
path = "/data/knowledge_base"
file_format_type = "markdown"
parse_strategy = {
schema = [
{name = "doc_name", type = "string"},
{name = "heading", type = "string"},
{name = "content", type = "string"},
{name = "code_block", type = "string"}
]
}
}
}
transform {
Replace {
source_table_name = "source_table"
result_table_name = "cleaned_table"
replace_field = "content"
pattern = "\\n+"
replacement = " "
}
# AI Embedding example (optional)
# Embedding {
# source_table_name = "cleaned_table"
# result_table_name = "vector_table"
# vector_field = "vector"
# model_provider = "openai"
# api_key = "${OPENAI_API_KEY}"
# }
}
sink {
Console {
source_table_name = "cleaned_table"
}
}
Source Code Highlights
Markdown Parsing Core:
MarkdownReadStrategy.java
This class uses theflexmark-javalibrary to traverse the Markdown AST and convert unstructured text into SeaTunnelRowstructures.Schema Evolution Adaptation:
FlinkRowConverter.java
Adds compatibility logic for dynamic schema changes in the Flink translation layer.
Conclusion
Apache SeaTunnel 2.3.13 continues its rapid iteration while clearly increasing investment in stability (bug fixes) and emerging scenarios (AI and CDC).
Whether addressing CDC challenges for Apache Flink users or supporting AI engineers in handling unstructured data pipelines, this version provides meaningful improvements.
Note: This analysis is based on the
2.3.13-releasebranch (Commite4052e95c). The final release content should refer to the official Release Notes.
Top comments (0)