Apache SeaTunnel 2.3.13 Preview: Core Engine Changes and AI ETL Trends Worth Watching

#apacheseatunnel #ai #etl #datascience

Apache SeaTunnel 2.3.13 is about to be released. As an important transitional version, it significantly improves the stability of the core engine, further completes the CDC capability landscape, and takes a key step toward the AI ETL domain.
Based on an in-depth analysis of the 2.3.13-release branch source code, we summarize the key updates in this upcoming version.

Key Highlights

1. Core Engine: Flink Schema Evolution and Zeta Stability

Flink Engine Supports CDC Schema Evolution (#9867)
This is a long-awaited feature for Apache Flink users. Version 2.3.13 officially implements automatic propagation and adaptation of source-side schema changes (DDL) at the Flink engine layer. This bridges the final gap from CDC Source to the Flink Engine, enabling Flink tasks to handle upstream schema changes as smoothly as the Zeta engine.
Deep Optimization of the Zeta Engine
- Remote Pagination Query Support (#9951): Significantly improves the response speed and user experience of SeaTunnel UI and REST API in large-scale task scenarios.
- Memory Leak Fix (#10315): Fixes a memory leak when canceling suspended tasks, improving the stability of long-running clusters.
- Multi-Sink Metrics Fix (#10376): Resolves inaccurate Write Count display in multi-destination write scenarios.

2. AI ETL: Embracing Unstructured Data

Multimodal Embedding Transform (#9673)
A new Multimodal Embedding Transform enables vectorization of text and image data. Combined with Markdown parsing, SeaTunnel can now directly build a full RAG (Retrieval-Augmented Generation) data pipeline from “unstructured documents” to “vector databases”.
Elasticsearch Vector Optimization (#10260)
Improves vector parameter support in the Elasticsearch Sink, making it better suited for AI vector storage scenarios.

3. Connector Ecosystem: Multi-table Sync and Type Enhancements

MongoDB: Significantly enhances multi-table synchronization mode and standardizes schema configuration parameters for non-relational data sources (#10370).
HBase: The Sink now supports DATE, TIME, TIMESTAMP, and DECIMAL types, and fixes Decimal deserialization issues (#10291).
Hive: Supports configuring multiple Metastore URIs for automatic failover (#10253), and introduces Socket/Connection timeout controls (#10254).
JDBC / Redshift: Upgrades driver versions to address OOM issues and fixes integer overflow bugs when merging schemas with large fields.

Key Fixes and Optimizations

This release fixes several critical bugs that may affect production stability. Users running high-load environments should pay special attention.

Component	Type	Issue	Impact
Core	Hang	FakeSource may hang after restore because `NoMoreSplits` is not sent (#10275)	High: Resolves tasks that cannot finish in specific scenarios
ClickHouse	Leak	Fixes `ThreadLocal` memory leak in `ClickhouseCatalogUtil` (#10264)	High: Prevents off-heap memory overflow in long-running services
Redshift	OOM	Upgrades JDBC driver to resolve OOM during large-scale data reads (#10393)	Medium: Improves Redshift synchronization stability
HBase	NPE	Fixes `NullPointerException` when reading empty tables (#10336)	Medium: Improves robustness in edge cases
SSH	Crash	Upgrades `jsch` library to fix buffer issues (#10298)	Medium: Improves SFTP/SSH connection stability

Deep Dive: Building an AI Knowledge Base Data Pipeline

A hidden core theme in 2.3.13 is “Unstructured Data to Vector.” The following demo illustrates how the new features can parse a local Markdown knowledge base and synchronize it to a vector storage system (Console output used as an example).

Scenario

Read technical documentation (Markdown) from a local directory, parse it into structured data by sections, and prepare it for embedding generation.

Configuration File (Demo)

env {
  parallelism = 1
  job.mode = "BATCH"
}

source {
  LocalFile {
    path = "/data/knowledge_base"
    file_format_type = "markdown"
    parse_strategy = {
        schema = [
            {name = "doc_name", type = "string"},
            {name = "heading", type = "string"},
            {name = "content", type = "string"},
            {name = "code_block", type = "string"}
        ]
    }
  }
}

transform {
  Replace {
    source_table_name = "source_table"
    result_table_name = "cleaned_table"
    replace_field = "content"
    pattern = "\\n+"
    replacement = " "
  }

  # AI Embedding example (optional)
  # Embedding {
  #   source_table_name = "cleaned_table"
  #   result_table_name = "vector_table"
  #   vector_field = "vector"
  #   model_provider = "openai"
  #   api_key = "${OPENAI_API_KEY}"
  # }
}

sink {
  Console {
    source_table_name = "cleaned_table"
  }
}

Source Code Highlights

Markdown Parsing Core: MarkdownReadStrategy.java
This class uses the flexmark-java library to traverse the Markdown AST and convert unstructured text into SeaTunnel Row structures.
Schema Evolution Adaptation: FlinkRowConverter.java
Adds compatibility logic for dynamic schema changes in the Flink translation layer.

Conclusion

Apache SeaTunnel 2.3.13 continues its rapid iteration while clearly increasing investment in stability (bug fixes) and emerging scenarios (AI and CDC).
Whether addressing CDC challenges for Apache Flink users or supporting AI engineers in handling unstructured data pipelines, this version provides meaningful improvements.

Note: This analysis is based on the 2.3.13-release branch (Commit e4052e95c). The final release content should refer to the official Release Notes.

DEV Community